[arXiv]score: 0.36
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
May 15, 2026
SP-KV introduces a lightweight per-token utility predictor that prunes KV cache entries by forecasting future attention utility, targeting memory and bandwidth bottlenecks in long-context transformer inference. Operates at fine granularity without architectural overhaul. Critical for practitioners running agentic or extended-context workloads where KV cache dominates GPU memory.
cs.LGcs.CL