[r/LocalLLaMA]score: 0.29

NVIDIA Nemotron Two-Tower 30B Hits 2.42x Throughput at 98.7% Quality

June 25, 2026

NVIDIA's Nemotron-TwoTower-30B-A3B uses a frozen autoregressive context tower paired with a diffusion denoiser that fills token blocks in parallel rather than sequentially. On the 30B-A3B backbone, the mask-diffusion approach achieves 2.42x wall-clock generation throughput while retaining 98.7% of autoregressive benchmark quality.

HOW THIS AFFECTS YOU

●

builderYou can swap in this model where throughput is the bottleneck and accept a 1.3% quality trade-off, potentially cutting inference costs significantly.

●

researcherThe two-tower hybrid architecture offers a concrete data point on diffusion LM viability at 30B scale with quantified quality-throughput trade-offs.

read original ↗huggingface.co

← back to feed