[arXiv]score: 0.19
Multi-Agent Debate Hurts Data Generation but Boosts Error Detection by 27pp F1
June 3, 2026
Across 6,000+ task-condition pairs and four model families, multi-agent debate degrades generative data cleaning by 1.6–15.5pp through hallucinated critic feedback, but improves error detection F1 by 27.4pp. A separate critic with code-execution grounding and evidence-gated generation is the first configuration to beat single-agent on a generative task (+5.3pp, p<0.05).
cs.AIcs.CLcs.MA
HOW THIS AFFECTS YOU
●
builderIf you're using multi-agent debate for data cleaning pipelines, split the roles: use debate for detection, not generation, and ground your critic with code execution.
●
researcherThe derived debate benefit condition and factorial experiment design provide a reusable framework for predicting when multi-agent debate helps versus hurts on structured tasks.