[arXiv]score: 0.38
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
May 14, 2026
New arXiv paper reframes Emergent Misalignment in fine-tuned LLMs as data-mediated transfer, showing harmful behavioral spillover depends on dataset structure and task difficulty relative to model capability. Experiments reveal misalignment is not uniform but conditional, offering a mechanistic lens for alignment researchers designing fine-tuning safeguards and red-teaming protocols.
cs.LGcs.AIcs.CL