[HUGGINGFACE]score: 0.93

Nemotron 3 Ultra: 550B MoE Hybrid Mamba-Transformer With 6x Inference Throughput Gain

June 11, 2026

Nemotron 3 Ultra is a 550B total / 55B active parameter MoE model combining Mamba and attention layers, pretrained on 20T tokens with 1M token context, achieving up to 6x higher inference throughput than comparable open LLMs at equivalent accuracy. It uses LatentMoE, NVFP4 pretraining, multi-teacher on-policy distillation, and reasoning budget control, and is fully open.

HOW THIS AFFECTS YOU

●

builderA fully open 550B MoE model with 6x throughput advantage and 1M context is immediately deployable for high-throughput agentic workloads — the Mamba-Attention hybrid architecture reduces memory pressure at long contexts.

●

researcherThe combination of LatentMoE, NVFP4 pretraining, and multi-teacher on-policy distillation at this scale is a significant open training recipe — the Mamba-Transformer hybrid at 550B is the largest of its kind.

●

founderAn open model matching frontier accuracy at 6x throughput changes the build-vs-buy calculus for inference-heavy products — direct competitive pressure on API-only reasoning providers.

read original ↗huggingface.co

← back to feed