[arXiv]score: 0.13
End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference
June 29, 2026
Researchers formulate inference as a constrained allocation problem conditioned on both input and runtime resource budget, proposing Learning to Allocate (L2A), an end-to-end framework for resource-adaptive LLM inference. The approach integrates lightweight, budget-conditioned and input-aware gating networks into LLMs, achieving 1.4x speedup on a 4x TPUv3 cluster and 2.5x reduction in peak memory usage on a 1.5x TPUv3 cluster compared to a static model.