[r/LocalLLaMA]score: 0.18
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
May 6, 2026
ParoQuant is a new 4-bit linear quantization method using pairwise rotation transforms to suppress activation outliers, targeting reasoning LLMs where quantization errors compound across thousands of tokens. On Qwen3-4B, it scores 73.3 on AIME24 versus AWQ's 62.2 and EfficientQAT's catastrophic 45.6, while delivering 2.1x inference speedup on RTX A6000, within 2.3 points of FP16 baseline (75.6). The result directly challenges the assumption that weight fine-tuning (EfficientQAT's approach) is sufficient, demonstrating outlier suppression via rotation is the critical bottleneck for reasoning accuracy under quantization. Engineers deploying reasoning models on consumer or prosumer GPUs should evaluate this immediately, with weights available on HuggingFace.
news