[r/singularity]score: 0.17
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added
May 5, 2026
The LLM Debate Benchmark has expanded to include nine new models evaluated across 683 adversarial, multi-turn motions using Bradley-Terry ratings on an Elo-like scale centered at 1500. Claude Opus 4.7 holds the top position at 1711 BT, while GPT-5.5 high enters at 1574, notably below GPT-5.4 high at 1625, and Grok 4.3 regresses sharply from 1512 to 1419 versus its predecessor. GLM-5.1 shows genuine improvement from 1536 to 1573, with a three-model judge panel achieving 0.55 mean cross-judge winner agreement. Researchers benchmarking reasoning and argumentation quality, particularly for agentic or debate-style applications, should track this leaderboard as a complement to static QA evals.
ai