[r/LocalLLaMA]score: 0.21

DeepSeek V4 Flash Runs Locally via Early llama.cpp PR, 3-bit Quant Promising

June 6, 2026

An in-progress llama.cpp PR (#24162) adds DeepSeek V4 Flash support; early testers report frontier-comparable intelligence at a size suited for local inference, with a custom 3-bit quantization preserving tensor layout. Current throughput is 5–6 tps with no GPU or FlashAttention support yet, so it is experiment-only for now.

HOW THIS AFFECTS YOU

●

builderYou can begin experimenting with local DeepSeek V4 Flash inference today via the PR, but expect instability and slow throughput until GPU support lands.

●

researcherEarly quantization results suggest V4 Flash has unusually high quantization robustness for its size class, worth tracking as the PR matures.

read original ↗reddit.com

← back to feed