[arXiv]score: 0.28

Posterior Attack Jailbreaks 30 LLMs Including GPT-5 and Claude 4.6

June 5, 2026

Safety-aligned LLMs develop internal classifiers that recognize unsafe content, and Posterior Attack exploits this by prompting models to generate exactly what their classifier would flag — in a single query. Tested across 30 open-source models up to 35B parameters and frontier models including GPT-5 and Claude 4.6, with stronger safety-judgment capability correlating with higher susceptibility.

HOW THIS AFFECTS YOU

●

builderA single-query jailbreak that works better against more safety-tuned models is a direct production risk for any application relying on frontier model refusals as a safety layer.

●

researcherThe Safety Paradox formalization shows monotonic alignment improvements analytically amplify posterior vulnerability — a fundamental tension in current RLHF-style safety training.

●

policyThis demonstrates that stronger safety alignment can increase exploitability, complicating regulatory assumptions that more alignment investment straightforwardly reduces harm.

read original ↗arxiv.org

← back to feed