[HUGGINGFACE]score: 0.48

Maximal Update Parametrization Extended to Gated Delta Networks for Stable LR Transfer

June 1, 2026

Scaling rules for Gated Delta Networks are derived by propagating coordinate-size estimates through gating mechanisms and recurrent state dynamics, enabling zero-shot learning-rate transfer across model widths under both AdamW and SGD. Experiments on language model pre-training confirm stable transfer where standard parametrization fails, extending muP to a class of sub-quadratic architectures.

paper

HOW THIS AFFECTS YOU

●

researcherIf you're training or scaling linear/recurrent architectures, these derived muP rules eliminate costly per-width hyperparameter sweeps — directly applicable to Gated Delta Network variants.

SOURCE

https://huggingface.co/papers/2606.04048

← back to feed