[arXiv]score: 0.18

VETO Benchmark Quantifies Misfired Alignment Where Models Reject Warranted Conclusions

June 18, 2026

VETO is a 2,032-sample benchmark derived from BBQ contrastive pairs that measures Misfired Alignment Rate (MAR), where safety-tuned LLMs override explicitly supported conclusions due to stereotype-related alignment. The work introduces MAR as a 0–100 metric to quantify over-refusal caused by alignment interventions.

HOW THIS AFFECTS YOU

●

researcherMAR and the VETO benchmark provide a quantitative tool for diagnosing alignment overcorrection distinct from standard safety or bias metrics.

●

policyMisfired alignment — where models reject evidence-backed conclusions — is a measurable failure mode that complicates safety evaluation and may affect high-stakes deployment contexts.

read original ↗arxiv.org

← back to feed