[arXiv]score: 0.19
Test-Time Safety Alignment
April 30, 2026
Test-Time Safety Alignment via Embedding Optimization
Researchers propose a test-time safety steering method that optimizes input word embeddings using zeroth-order gradient estimation against a black-box text-moderation API, requiring no model weight updates or fine-tuning. The approach operates sub-lexically on aligned models exhibiting bimodal refuse-or-comply output distributions, a significantly harder control target than prior smooth open-ended generation tasks. Red-teamers, safety auditors, and alignment researchers deploying black-box LLMs should take note, as this enables inference-time harmfulness minimization without internal model access. This extends embedding-space control beyond surface profanity filtering into semantically grounded safety alignment, competing with RLHF and activation steering but operating purely at the input boundary.
cs.CLcs.AIcs.LG