[r/LocalLLaMA]score: 0.18

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

May 6, 2026

ParoQuant is a new 4-bit linear quantization method using pairwise rotation transforms to suppress activation outliers, targeting reasoning LLMs where quantization errors compound across thousands of tokens. On Qwen3-4B, it scores 73.3 on AIME24 versus AWQ's 62.2 and EfficientQAT's catastrophic 45.6, while delivering 2.1x inference speedup on RTX A6000, within 2.3 points of FP16 baseline (75.6). The result directly challenges the assumption that weight fine-tuning (EfficientQAT's approach) is sufficient, demonstrating outlier suppression via rotation is the critical bottleneck for reasoning accuracy under quantization. Engineers deploying reasoning models on consumer or prosumer GPUs should evaluate this immediately, with weights available on HuggingFace.

news

SOURCE

https://i.redd.it/9gim2lznimzg1.jpeg

← back to feed