[arXiv]score: 0.11
No Single Model Reliably Assesses Quality Across 41K+ Translation Directions
June 2, 2026
Benchmarking four embedding models and nine reference-free quality estimators across FLORES-200 covering up to 41,412 ordered language-pair directions shows no universally reliable model, and naive QE ensembles dilute strong signals rather than improving them. Target-language documentation coverage is the strongest predictor of QE score reliability.
cs.CL
HOW THIS AFFECTS YOU
●
builderIf you are filtering multilingual training data with a single QE model, this suggests you need direction-specific model selection rather than a one-size-fits-all approach.
●
researcherDirection-aware evaluation rather than aggregate scoring is the key takeaway for multilingual data pipeline design.