Optimal Tokenizer Algorithm Found, But Existing BPE Already Within 1%
June 11, 2026
A new algorithm computes provably optimal tokenizers using cutting-plane techniques analogous to TSP solvers, but finds current BPE tokenizers are already within 1% of optimal on training data. Generalization on held-out data and vocabulary size adjustments make the practical gain negligible.
HOW THIS AFFECTS YOU
●
researcherConfirms that tokenizer optimality is practically solvable despite theoretical intractability, and establishes a ceiling for how much tokenizer improvements can contribute to model efficiency.