[HN]score: 0.19

Optimal Tokenizer Algorithm Found, But Existing BPE Already Within 1%

June 11, 2026

A new algorithm computes provably optimal tokenizers using cutting-plane techniques analogous to TSP solvers, but finds current BPE tokenizers are already within 1% of optimal on training data. Generalization on held-out data and vocabulary size adjustments make the practical gain negligible.

HOW THIS AFFECTS YOU

●

researcherConfirms that tokenizer optimality is practically solvable despite theoretical intractability, and establishes a ceiling for how much tokenizer improvements can contribute to model efficiency.

read original ↗blog.aqnichol.com

← back to feed