[arXiv]score: 0.19

Multi-Agent Debate Hurts Data Generation but Boosts Error Detection by 27pp F1

June 3, 2026

Across 6,000+ task-condition pairs and four model families, multi-agent debate degrades generative data cleaning by 1.6–15.5pp through hallucinated critic feedback, but improves error detection F1 by 27.4pp. A separate critic with code-execution grounding and evidence-gated generation is the first configuration to beat single-agent on a generative task (+5.3pp, p<0.05).

cs.AIcs.CLcs.MA

HOW THIS AFFECTS YOU

●

builderIf you're using multi-agent debate for data cleaning pipelines, split the roles: use debate for detection, not generation, and ground your critic with code execution.

●

researcherThe derived debate benefit condition and factorial experiment design provide a reusable framework for predicting when multi-agent debate helps versus hurts on structured tasks.

SOURCE

https://arxiv.org/abs/2606.02866

← back to feed