[HN]score: 0.39

KVarN Delivers 3–5x KV-Cache Capacity With No Throughput Loss on vLLM

June 4, 2026

KVarN is a calibration-free vLLM KV-cache quantization backend from Huawei that achieves 3–5x more cache capacity and up to 1.3x throughput versus FP16, compared to TurboQuant which trades 40–52% throughput for similar capacity gains. Tested on Qwen3-32B at 16K context with TP=2, it matches FP16 accuracy and is enabled via a single flag with no model changes.

HOW THIS AFFECTS YOU

●

builderDrop-in vLLM flag enables 3–5x longer context or more concurrent requests at no accuracy or throughput cost, directly improving serving economics for long-context and agentic workloads.

●

researcherThe variance normalization approach breaks the typical capacity-throughput tradeoff in KV quantization; the Qwen3-32B AIME25 results provide a concrete accuracy baseline to compare against.

SOURCE

https://github.com/huawei-csl/KVarN

← back to feed