●builderA single-query jailbreak that works better against more safety-tuned models is a direct production risk for any application relying on frontier model refusals as a safety layer.
●researcherThe Safety Paradox formalization shows monotonic alignment improvements analytically amplify posterior vulnerability — a fundamental tension in current RLHF-style safety training.
●policyThis demonstrates that stronger safety alignment can increase exploitability, complicating regulatory assumptions that more alignment investment straightforwardly reduces harm.