[HN]score: 0.39
KVarN Delivers 3–5x KV-Cache Capacity With No Throughput Loss on vLLM
June 4, 2026
KVarN is a calibration-free vLLM KV-cache quantization backend from Huawei that achieves 3–5x more cache capacity and up to 1.3x throughput versus FP16, compared to TurboQuant which trades 40–52% throughput for similar capacity gains. Tested on Qwen3-32B at 16K context with TP=2, it matches FP16 accuracy and is enabled via a single flag with no model changes.
HOW THIS AFFECTS YOU
●
builderDrop-in vLLM flag enables 3–5x longer context or more concurrent requests at no accuracy or throughput cost, directly improving serving economics for long-context and agentic workloads.
●
researcherThe variance normalization approach breaks the typical capacity-throughput tradeoff in KV quantization; the Qwen3-32B AIME25 results provide a concrete accuracy baseline to compare against.