[arXiv]score: 0.50
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
May 15, 2026
Analyzes principled scaling laws for Mixture-of-Experts architectures across three regimes (co-scaling N≈Ne, co-scaling N≈M≈K, full proportional scaling), providing guidance for hyperparameter scaling with network width, expert width, number of experts, sparsity, and depth.
cs.LGstat.ML