[arXiv]score: 0.15

284-Paper Audit Finds Human Eval Protocols Widely Under-Reported

June 9, 2026

A manual review of 284 NLP papers from 2023–2025 plus LLM-assisted analysis of 1,800+ more finds systematic under-reporting across 20 reproducibility criteria for human evaluation studies. Key gaps include who provided judgments, how tasks were designed, and how scores should be interpreted, undermining comparability across published results.

HOW THIS AFFECTS YOU

●

researcherWorth watching because benchmark comparisons relying on human eval from recent *CL papers may be less reliable than assumed — your own eval protocols likely have gaps against these 20 criteria.

read original ↗arxiv.org

← back to feed