[HN]score: 0.27

DeepSeek 4 Flash local inference engine for Metal

May 7, 2026

ds4.c is a purpose-built Metal inference engine for DeepSeek V4 Flash, a 284B MoE model running on MacBooks with 128GB RAM via 2-bit quantization and compressed KV cache with disk persistence. The engine targets proportional thinking-token generation, reportedly 1/5 the reasoning overhead of comparable models, enabling practical on-device chain-of-thought. Engineers running long-context local inference up to 1M tokens on Apple Silicon should evaluate this immediately. Unlike llama.cpp's generalist GGUF approach, ds4.c is single-model, officially logit-validated, and Metal-native.

SOURCE

https://github.com/antirez/ds4

← back to feed