[arXiv]score: 0.24
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
May 7, 2026
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
Researchers introduce FASQ, a calibration-free product quantization framework for LLM weight compression, achieving 27-49% of original FP16 model size by tuning sub-vector size and codebook cardinality rather than fixed bit-widths. On Meta-Llama-3-8B, FASQ scores 67.1-67.7 average accuracy at 37-42% model size, outperforming 4-bit GPTQ and AWQ without requiring calibration data. Custom CUDA kernels including a LUT-free direct-compute GEMV enable practical inference, addressing the long-standing deployment bottleneck of product quantization on commodity GPUs. Edge and on-device ML engineers constrained by memory budgets between standard 4-bit and 8-bit quantization points should evaluate FASQ immediately as a drop-in compression alternative.
cs.LGcs.AIcs.AR