[HN]score: 0.27
DeepSeek 4 Flash local inference engine for Metal
May 7, 2026
ds4.c is a purpose-built Metal inference engine for DeepSeek V4 Flash, a 284B MoE model running on MacBooks with 128GB RAM via 2-bit quantization and compressed KV cache with disk persistence. The engine targets proportional thinking-token generation, reportedly 1/5 the reasoning overhead of comparable models, enabling practical on-device chain-of-thought. Engineers running long-context local inference up to 1M tokens on Apple Silicon should evaluate this immediately. Unlike llama.cpp's generalist GGUF approach, ds4.c is single-model, officially logit-validated, and Metal-native.