HACKOBAR_item
[r/LocalLLaMA]score: 0.17

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding- Google Developers Blog

May 5, 2026
Google's TPU inference team has demonstrated 3X throughput gains using diffusion-style speculative decoding, where a lightweight draft model generates multiple candidate tokens simultaneously rather than sequentially. Unlike traditional autoregressive speculative decoding, the diffusion approach better exploits TPU systolic array parallelism, reducing memory-bandwidth bottlenecks at batch inference scale. ML engineers deploying Gemini or JAX-based models on Cloud TPU v4/v5 pods should prioritize testing this pipeline immediately. This directly challenges NVIDIA TensorRT-LLM's speculative decoding benchmarks on H100s, making TPU cost-per-token economics significantly more competitive.
news