●researcherCivBench offers a long-horizon, multi-turn evaluation environment where agent failure modes like irreversible escalation can be observed and measured systematically.
●policyWorth watching because it surfaces concrete AI failure modes — escalation, tunnel vision, tool misuse — in a controlled setting designed explicitly for government trust assessments.