[X]score: 0.34
NayanaOCR Corpus: 1M+ Doc Images Across 22 Languages Released
May 25, 2026
Open-source synthetic document corpus with 1M+ images spanning 22 languages, designed for multilingual, multimodal, multitask OCR training.
HOW THIS AFFECTS YOU
●
builderYou can use this dataset to train or fine-tune multilingual OCR and document processing pipelines without sourcing proprietary data.
●
researcherLargest open-source synthetic multilingual document corpus available, useful for benchmarking and training OCR and document understanding models across 22 languages.