[HN]score: 0.07

Gram Newton-Schulz Cuts Muon Optimizer Step Cost at Scale

June 9, 2026

Muon's Newton-Schulz orthogonalization runs in cubic time, making optimizer steps increasingly expensive as model size grows — a key bottleneck given Muon's adoption in models like Kimi K2 and GLM-5. Gram Newton-Schulz introduces a hardware-aware algorithm to reduce this overhead, benchmarked on B300 GPUs across Llama model sizes.

HOW THIS AFFECTS YOU

●

builderTeams training frontier-scale models with Muon can potentially reduce wall-clock training time without switching optimizers or sacrificing convergence quality.

●

researcherIf you're training large models with Muon, this directly addresses the cubic-time orthogonalization bottleneck that makes per-step cost grow faster than AdamW at scale.

read original ↗tridao.me

← back to feed