[arXiv]score: 0.44

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

May 13, 2026

Demonstrates simple averaging produces misleading benchmark rankings under sparse evaluation matrices and item difficulty gaps; Item Response Theory recovers ground-truth rankings across NLP, clinical trials, autonomous vehicles, and cybersecurity domains with Spearman ρ degradation analysis.

cs.LGcs.AI

SOURCE

https://arxiv.org/abs/2605.11205

← back to feed