[HUGGINGFACE]score: 0.80
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes
May 16, 2026
A theoretical analysis proves that MSE and cross-entropy losses combined with positively biased activations (e.g., ReLU) cause systematic negative weight drift during early training, as gradients on positive pre-activations are non-negative in expectation. The finding is architecture-agnostic and intrinsic to optimization itself. Practitioners debugging training instability or designing activation functions should take note, as this formalizes a previously underappreciated dynamic.
paper