[arXiv]score: 0.44
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
May 13, 2026
Demonstrates simple averaging produces misleading benchmark rankings under sparse evaluation matrices and item difficulty gaps; Item Response Theory recovers ground-truth rankings across NLP, clinical trials, autonomous vehicles, and cybersecurity domains with Spearman ρ degradation analysis.
cs.LGcs.AI