[arXiv]score: 0.41
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
May 13, 2026
Sample-efficient LLM evaluation method for measuring five-nines (99.999%) reliability on saturated benchmarks, addressing rare failure probability estimation critical for reliability-sensitive deployment scenarios.
cs.LG