[arXiv]score: 0.50

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

May 15, 2026

Analyzes principled scaling laws for Mixture-of-Experts architectures across three regimes (co-scaling N≈Ne, co-scaling N≈M≈K, full proportional scaling), providing guidance for hyperparameter scaling with network width, expert width, number of experts, sparsity, and depth.

cs.LGstat.ML

SOURCE

https://arxiv.org/abs/2605.14200

← back to feed