[arXiv]score: 0.77
Copy-as-Decode Speeds LLM Text Editing 6.8×–303× via Parallel Prefill
May 26, 2026
A grammar-constrained decoding mechanism replaces autoregressive regeneration of unchanged tokens during LLM editing with parallel-prefill KV cache updates, achieving 6.8×–303× kernel speedup on Qwen2.5-1.5B and 7B for copy spans of 8–N tokens.
cs.CLcs.AI
HOW THIS AFFECTS YOU
●
builderYou can apply this decoding-layer technique to dramatically reduce latency and compute cost for code and text editing workloads without end-to-end retraining.
●
researcherThe structured two-primitive grammar with token-level FSM enforcement and parallel-prefill acceptance offers a training-free alternative to speculative decoding for edit-heavy inference tasks.