[arXiv]score: 0.20

Refusal Steering Works on MoE LLMs, Single Expert Sufficient

June 4, 2026

Refusal suppression via steering vectors transfers effectively to MoE architectures including three open-source models, with routing complexity posing no barrier. Two expert-aware methods show refusal behavior can be steered using a single expert's output, and refusal signals are shown to be distinct from expert routing patterns.

cs.CLcs.LG

HOW THIS AFFECTS YOU

●

researcherFindings suggest refusal mechanisms in MoE models are localized and separable from routing, opening new directions for mechanistic interpretability of safety alignment.

●

policyThis changes the threat model for open-source MoE safety alignment — refusal suppression is as tractable as in dense models, with single-expert attack surfaces.

SOURCE

https://arxiv.org/abs/2606.04160

← back to feed