HACKOBAR_item
[arXiv]score: 0.24

A Theoretical Game of Attacks via Compositional Skills

May 5, 2026
Researchers at arXiv (2605.01034) formalize LLM red-teaming as a two-player game between attacker and defender, deriving closed-form best-response strategies and characterizing Nash equilibria that reveal structural attacker advantages. The framework unifies existing adversarial prompting methods under one theoretical lens and yields a provably optimal defense strategy. Empirical results show the theory-grounded attack outperforms prior adversarial prompting baselines across diverse evaluation settings. AI safety engineers and alignment researchers should prioritize this work as it reframes jailbreak robustness from empirical whack-a-mole to principled game-theoretic optimization.
cs.CL