[arXiv]score: 0.77

Copy-as-Decode Speeds LLM Text Editing 6.8×–303× via Parallel Prefill

May 26, 2026

A grammar-constrained decoding mechanism replaces autoregressive regeneration of unchanged tokens during LLM editing with parallel-prefill KV cache updates, achieving 6.8×–303× kernel speedup on Qwen2.5-1.5B and 7B for copy spans of 8–N tokens.

cs.CLcs.AI

HOW THIS AFFECTS YOU

●

builderYou can apply this decoding-layer technique to dramatically reduce latency and compute cost for code and text editing workloads without end-to-end retraining.

●

researcherThe structured two-primitive grammar with token-level FSM enforcement and parallel-prefill acceptance offers a training-free alternative to speculative decoding for edit-heavy inference tasks.

SOURCE

https://arxiv.org/abs/2604.18170

← back to feed