[arXiv]score: 0.22

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

April 30, 2026

Researchers introduced Incremental Completion Decomposition (ICD), a trajectory-based jailbreak that extracts harmful LLM outputs by eliciting single-word continuations sequentially before requesting a full response, effectively suppressing refusal mechanisms. ICD outperforms existing methods on AdvBench, JailbreakBench, and StrongREJECT benchmarks across multiple model families. Red teamers and safety engineers should prioritize this finding, as it exposes a fundamental vulnerability in RLHF-trained refusal behavior that prefilling variants further amplify, demanding urgent updates to conversational safety evaluations and defenses.

cs.CLcs.CR

SOURCE

https://arxiv.org/abs/2604.25921

← back to feed