[HN]score: 0.29

Accelerating Gemma 4: faster inference with multi-token prediction drafters

May 5, 2026

Google released Multi-Token Prediction (MTP) speculative decoding drafters for the Gemma 4 family, delivering up to 3x inference speedup with zero output quality degradation. The architecture pairs lightweight drafter models against Gemma 4 target models (up to 31B parameters), parallelizing token verification to overcome memory-bandwidth bottlenecks. Validated across LiteRT-LM, MLX, Hugging Face Transformers, and vLLM, this matters most for edge and consumer-hardware deployments where VRAM constraints throttle autoregressive throughput. Compared to standard speculative decoding, MTP drafters are purpose-trained on Gemma 4 distributions, improving draft acceptance rates and making the speedup practically reliable rather than workload-dependent.

SOURCE

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

← back to feed