●researcherAI-judged benchmarks with opaque human baselines introduce circular validity problems worth scrutinizing before citing leaderboard rankings.
●founderWorth watching because leaderboard rankings like these influence model selection decisions, but the methodology here is contested.