[HUGGINGFACE]score: 0.80

Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes

May 16, 2026

A theoretical analysis proves that MSE and cross-entropy losses combined with positively biased activations (e.g., ReLU) cause systematic negative weight drift during early training, as gradients on positive pre-activations are non-negative in expectation. The finding is architecture-agnostic and intrinsic to optimization itself. Practitioners debugging training instability or designing activation functions should take note, as this formalizes a previously underappreciated dynamic.

paper

SOURCE

https://huggingface.co/papers/2605.17659

← back to feed