[HUGGINGFACE]score: 0.42
No Single Model Reliably Assesses Quality Across 41K+ Multilingual Translation Directions
May 28, 2026
Benchmarking four embedding models on FLORES-200 and BOUQuET across 6,654 language-pair directions and nine reference-free quality estimators across 41,412 directions finds no universally reliable model, with naive QE ensembles diluting strong signals. Results have direct implications for multilingual data pipeline design.
paper
HOW THIS AFFECTS YOU
●
builderIf you're filtering multilingual training data with a single QE model, this research suggests you're likely introducing systematic errors in specific language directions.
●
researcherThe large-scale cross-direction evaluation exposes reliability gaps in current multilingual QE and embedding models that warrant targeted model development.