[X]score: 0.30

GDPval-AA v2 Benchmark Validity Questioned Despite High Weighting

June 16, 2026

GDPval-AA v2 uses frontier models to judge other AI outputs on questions derived from a closed benchmark, then weights that heavily in the Intelligence Index v4.1. The human ELO baseline methodology is unspecified, making the 1818 ELO score for Claude Fable 5 difficult to interpret against real-world capability.

HOW THIS AFFECTS YOU

●

researcherAI-judged benchmarks with opaque human baselines introduce circular validity problems worth scrutinizing before citing leaderboard rankings.

●

founderWorth watching because leaderboard rankings like these influence model selection decisions, but the methodology here is contested.

read original ↗x.com

← back to feed