[arXiv]score: 0.24

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

May 7, 2026

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression Researchers introduce FASQ, a calibration-free product quantization framework for LLM weight compression, achieving 27-49% of original FP16 model size by tuning sub-vector size and codebook cardinality rather than fixed bit-widths. On Meta-Llama-3-8B, FASQ scores 67.1-67.7 average accuracy at 37-42% model size, outperforming 4-bit GPTQ and AWQ without requiring calibration data. Custom CUDA kernels including a LUT-free direct-compute GEMV enable practical inference, addressing the long-standing deployment bottleneck of product quantization on commodity GPUs. Edge and on-device ML engineers constrained by memory budgets between standard 4-bit and 8-bit quantization points should evaluate FASQ immediately as a drop-in compression alternative.

cs.LGcs.AIcs.AR

SOURCE

https://arxiv.org/abs/2605.04084

← back to feed