[arXiv]score: 0.15
RL Training Recruits a Pre-Existing Welfare Axis in LLM Internal Representations
May 29, 2026
Experiments in a semantically neutral maze environment show that RL fine-tuning activates a latent welfare representation in LLMs, where punishment vectors align with negative emotion concepts, promote failure tokens, and induce refusal and backtracking when used for steering. The reward and punishment vectors are nearly antiparallel, suggesting a single functional axis rather than independent representations.
cs.LGcs.CL
HOW THIS AFFECTS YOU
●
researcherProvides mechanistic evidence that RLHF-style training taps into pre-existing internal structure rather than creating new representations — relevant for interpretability and steering research.
●
policyThe finding that punishment vectors induce negative self-reports and refusal behaviors has direct implications for welfare considerations in RL-trained models and alignment safety analysis.