[HN]score: 0.22
Transformers May Not Need Separate Q, K, V Projections
June 4, 2026
Systematic evaluation of QKV projection sharing across 300M and 1.2B parameter models trained on 10B tokens finds that shared-projection variants perform on par or better than standard three-projection attention on vision and language tasks. Variants like Q=K=V reduce parameter count and compute in the attention layer without consistent accuracy loss, though symmetric attention maps require mitigation via 2D positional encodings.
HOW THIS AFFECTS YOU
●
builderShared QKV projections could reduce attention layer parameter count and inference cost — worth evaluating if you are optimizing transformer-based models for production.
●
researcherWorth watching because it challenges a foundational assumption of transformer design with controlled experiments at non-trivial scale across multiple modalities.