[HUGGINGFACE]score: 0.48
Maximal Update Parametrization Extended to Gated Delta Networks for Stable LR Transfer
June 1, 2026
Scaling rules for Gated Delta Networks are derived by propagating coordinate-size estimates through gating mechanisms and recurrent state dynamics, enabling zero-shot learning-rate transfer across model widths under both AdamW and SGD. Experiments on language model pre-training confirm stable transfer where standard parametrization fails, extending muP to a class of sub-quadratic architectures.
paper
HOW THIS AFFECTS YOU
●
researcherIf you're training or scaling linear/recurrent architectures, these derived muP rules eliminate costly per-width hyperparameter sweeps — directly applicable to Gated Delta Network variants.