[arXiv]score: 0.19

Test-Time Safety Alignment

April 30, 2026

Test-Time Safety Alignment via Embedding Optimization Researchers propose a test-time safety steering method that optimizes input word embeddings using zeroth-order gradient estimation against a black-box text-moderation API, requiring no model weight updates or fine-tuning. The approach operates sub-lexically on aligned models exhibiting bimodal refuse-or-comply output distributions, a significantly harder control target than prior smooth open-ended generation tasks. Red-teamers, safety auditors, and alignment researchers deploying black-box LLMs should take note, as this enables inference-time harmfulness minimization without internal model access. This extends embedding-space control beyond surface profanity filtering into semantically grounded safety alignment, competing with RLHF and activation steering but operating purely at the input boundary.

cs.CLcs.AIcs.LG

SOURCE

https://arxiv.org/abs/2604.26167

← back to feed