[arXiv]score: 0.11

No Single Model Reliably Assesses Quality Across 41K+ Translation Directions

June 2, 2026

Benchmarking four embedding models and nine reference-free quality estimators across FLORES-200 covering up to 41,412 ordered language-pair directions shows no universally reliable model, and naive QE ensembles dilute strong signals rather than improving them. Target-language documentation coverage is the strongest predictor of QE score reliability.

cs.CL

HOW THIS AFFECTS YOU

●

builderIf you are filtering multilingual training data with a single QE model, this suggests you need direction-specific model selection rather than a one-size-fits-all approach.

●

researcherDirection-aware evaluation rather than aggregate scoring is the key takeaway for multilingual data pipeline design.

SOURCE

https://arxiv.org/abs/2606.00285

← back to feed