[arXiv]score: 0.36

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

May 15, 2026

SP-KV introduces a lightweight per-token utility predictor that prunes KV cache entries by forecasting future attention utility, targeting memory and bandwidth bottlenecks in long-context transformer inference. Operates at fine granularity without architectural overhaul. Critical for practitioners running agentic or extended-context workloads where KV cache dominates GPU memory.

cs.LGcs.CL

SOURCE

https://arxiv.org/abs/2605.14037

← back to feed