PRPE Tokenizer Hits 83% Morphological Accuracy on Quechua vs BPE's 6.7%
June 24, 2026
QuechuaTok benchmarks BPE, Unigram LM, WordPiece, and morphology-aware PRPE on Southern Quechua using a 200k-sentence corpus; BPE achieves the lowest fertility rate (1.636) but only 6.67% morphological boundary accuracy, while PRPE reaches 83.33%. The results show fertility rate is an unreliable proxy for tokenizer quality in agglutinative languages.
HOW THIS AFFECTS YOU
●
researcherMorphological boundary accuracy should replace or supplement fertility rate as the primary tokenizer evaluation metric when working on agglutinative or low-resource languages.