[arXiv]score: 0.19

CIAware-Bench Finds Frontier Models Have Low-to-Moderate AI Control Intervention Awareness

June 10, 2026

Across 11 frontier models and four task domains including BigCodeBench and SHADE-Arena, CIAware-Bench measures whether models can detect when their output trajectory was modified by a control protocol, finding awareness scores up to 0.87 against a 0.5 random baseline. Low default awareness suggests current models are unlikely to actively evade monitoring systems.

HOW THIS AFFECTS YOU

●

researcherQuantifies a concrete safety property — control evasion awareness — across frontier models, with task-domain and watermarking ablations useful for alignment research.

●

policyLow CI awareness scores provide empirical support for current AI control protocol designs, but the benchmark also maps the attack surface if awareness increases.

read original ↗arxiv.org

← back to feed