●builderTeams training frontier-scale models with Muon can potentially reduce wall-clock training time without switching optimizers or sacrificing convergence quality.
●researcherIf you're training large models with Muon, this directly addresses the cubic-time orthogonalization bottleneck that makes per-step cost grow faster than AdamW at scale.