[arXiv]score: 0.24

A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

May 7, 2026

MetaAdamW (arXiv:2605.04055) introduces a meta-learned optimizer replacing AdamW's uniform hyperparameters with a lightweight Transformer encoder that computes per-group learning rates and weight decay from gradient norms, momentum norms, and inter-group correlations. Training uses a composite meta-objective combining gradient alignment, loss decrease, and generalization gap, augmented by homoscedastic uncertainty weighting with task-specific priority scaling. Evaluated across five tasks, this directly targets the well-documented layer-wise optimization heterogeneity that AdamW and Lion ignore entirely. ML engineers training large heterogeneous architectures, particularly multi-task or multi-modal models, should evaluate this as a drop-in AdamW replacement with adaptive per-group regularization.

cs.LG

SOURCE

https://arxiv.org/abs/2605.04055

← back to feed