[arXiv]score: 1.16

Chain-of-Thought Hijacking Achieves 94–100% Jailbreak Rate on Gemini 2.5 Pro, o4-Mini, Claude 4 Sonnet

May 26, 2026

Inducing large reasoning models into extended benign puzzle-solving (5+ minutes) before the harmful prompt achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4-Mini, Grok 3 Mini, and Claude 4 Sonnet respectively on HarmBench, with activation probing revealing refusal behavior depends on early reasoning context.

cs.AI

HOW THIS AFFECTS YOU

●

builderYou need to account for this attack vector in any product built on reasoning models (o4-mini, Gemini 2.5 Pro, Claude 4 Sonnet) where adversarial users could craft long-context prompts.

●

researcherActivation probing and causal interventions reveal that refusal is context-dependent in reasoning models, opening a concrete mechanistic research direction for safety in LRMs.

●

policyNear-perfect jailbreak rates on all major frontier reasoning models via a black-box attack with no model access is a critical safety finding requiring immediate attention from model providers.

SOURCE

https://arxiv.org/abs/2510.26418

← back to feed