[HN]score: 0.29
Accelerating Gemma 4: faster inference with multi-token prediction drafters
May 5, 2026
Google released Multi-Token Prediction (MTP) speculative decoding drafters for the Gemma 4 family, delivering up to 3x inference speedup with zero output quality degradation. The architecture pairs lightweight drafter models against Gemma 4 target models (up to 31B parameters), parallelizing token verification to overcome memory-bandwidth bottlenecks. Validated across LiteRT-LM, MLX, Hugging Face Transformers, and vLLM, this matters most for edge and consumer-hardware deployments where VRAM constraints throttle autoregressive throughput. Compared to standard speculative decoding, MTP drafters are purpose-trained on Gemma 4 distributions, improving draft acceptance rates and making the speedup practically reliable rather than workload-dependent.