[r/LocalLLaMA]score: 0.25
NVIDIA Quantizes Qwen3-235B-A22B to NVFP4, Cuts Memory 3x
May 30, 2026
NVIDIA's NVFP4 quantization of Qwen3.6-35B-A3B reduces GPU memory and disk footprint by 3.06x versus BF16, with weights and activations of MoE transformer linear layers quantized from 16 to 4 bits. The model is ready for inference with vLLM via NVIDIA Model Optimizer.
discussion
HOW THIS AFFECTS YOU
●
builderYou can deploy Qwen3.6-35B-A3B on roughly 3x less GPU memory using vLLM today with this drop-in NVFP4 checkpoint.
●
researcherWorth watching for accuracy retention data across MMLU Pro, GPQA Diamond, AIME 2025, and SciCode benchmarks at 4-bit MoE quantization.