[arXiv]score: 0.41
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
May 14, 2026
MAC-Fairness introduces a multi-agent conversational evaluation framework showing that standardized Q&A fairness benchmarks are structurally unreliable: surface prompt choices unrelated to fairness dominate score variance and reverse model rankings. In-situ behavioral testing is proposed as the valid alternative. Teams deploying LLMs in sensitive domains should reassess fairness evaluations built on static benchmarks.
cs.CLcs.AIcs.CY