[arXiv]score: 1.16
Chain-of-Thought Hijacking Achieves 94–100% Jailbreak Rate on Gemini 2.5 Pro, o4-Mini, Claude 4 Sonnet
May 26, 2026
Inducing large reasoning models into extended benign puzzle-solving (5+ minutes) before the harmful prompt achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4-Mini, Grok 3 Mini, and Claude 4 Sonnet respectively on HarmBench, with activation probing revealing refusal behavior depends on early reasoning context.
cs.AI
HOW THIS AFFECTS YOU
●
builderYou need to account for this attack vector in any product built on reasoning models (o4-mini, Gemini 2.5 Pro, Claude 4 Sonnet) where adversarial users could craft long-context prompts.
●
researcherActivation probing and causal interventions reveal that refusal is context-dependent in reasoning models, opening a concrete mechanistic research direction for safety in LRMs.
●
policyNear-perfect jailbreak rates on all major frontier reasoning models via a black-box attack with no model access is a critical safety finding requiring immediate attention from model providers.