[arXiv]score: 0.13

NRT-Bench: Multi-Turn Red-Teaming Breaks LLM Safety Agents in Nuclear Plant Sim

June 19, 2026

NRT-Bench evaluates LLM agents as operators of a simulated nuclear power plant, using adaptive multi-turn adversarial attacks across four message channels and six critical safety functions. Harm is measured objectively — a run ends when a safety function fails — rather than relying on LLM-judged outputs. Testing four frontier models shows adaptive attacks reliably breach safety limits.

HOW THIS AFFECTS YOU

●

researcherThe objective harm signal and paired-replay protocol offer a more rigorous evaluation framework than LLM-judged text for adversarial robustness benchmarking.

●

policyWorth watching because it quantifies failure modes of LLM supervisory agents in safety-critical infrastructure under realistic adaptive pressure, directly relevant to deployment governance.

read original ↗arxiv.org

← back to feed