[HN]score: 0.95

DiffusionGemma 26B MoE Delivers 4x Faster Inference via Block-Level Text Generation

June 10, 2026

DiffusionGemma is a 26B Mixture-of-Experts model released under Apache 2.0 that generates blocks of text simultaneously rather than token-by-token, achieving up to 4x faster inference on dedicated GPUs. Built on Gemma 4 architecture with diffusion-based generation, it targets speed-critical and interactive local workflows.

HOW THIS AFFECTS YOU

●

builderYou can deploy this Apache 2.0 model today for latency-sensitive applications where autoregressive generation has been a bottleneck.

●

researcherThe MoE-plus-diffusion architecture combination is worth examining — block-level generation at this scale with claimed 4x speedup is a concrete data point on non-autoregressive scaling.

●

founder4x inference speedup on an open-weight 26B model changes the cost and UX calculus for real-time AI features that were previously too slow to ship.

read original ↗blog.google

← back to feed