[r/MachineLearning]score: 0.13
We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
April 23, 2026
**Summary:**
ArbitrHQ ran a systematic OCR benchmark across 18 LLMs using 42 standardized documents with 10 runs per model (7,560 total API calls), measuring pass^n reliability, cost-per-success, latency, and critical field accuracy. The core finding is that smaller and older models match flagship model accuracy on standard document extraction tasks at significantly lower cost, suggesting teams defaulting to GPT-4o-class models for OCR workflows are likely over-spending. The benchmark and a self-serve testing tool are open-sourced at github.com/ArbitrHq/ocr-mini-bench, making it directly applicable for teams auditing their own document processing pipelines.
---
**Practitioner note:** The pass^n metric (probability of success across n repeated calls) is a useful framing for production reliability, though the 42-document test set is narrow — results may not generalize to domain-specific or degraded document types. Worth running their tool against your own corpus before switching models.
research