HACKOBAR_item
[arXiv]score: 0.41

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

May 13, 2026
Sample-efficient LLM evaluation method for measuring five-nines (99.999%) reliability on saturated benchmarks, addressing rare failure probability estimation critical for reliability-sensitive deployment scenarios.
cs.LG