[arXiv]score: 0.20
Refusal Steering Works on MoE LLMs, Single Expert Sufficient
June 4, 2026
Refusal suppression via steering vectors transfers effectively to MoE architectures including three open-source models, with routing complexity posing no barrier. Two expert-aware methods show refusal behavior can be steered using a single expert's output, and refusal signals are shown to be distinct from expert routing patterns.
cs.CLcs.LG
HOW THIS AFFECTS YOU
●
researcherFindings suggest refusal mechanisms in MoE models are localized and separable from routing, opening new directions for mechanistic interpretability of safety alignment.
●
policyThis changes the threat model for open-source MoE safety alignment — refusal suppression is as tractable as in dense models, with single-expert attack surfaces.