Tapered LLM Architecture Improves Perplexity by Front-Loading Parameter Capacity
June 23, 2026
Allocating more parameters to earlier transformer layers and fewer to later ones improves perplexity under a fixed budget, while the reverse hurts. The Tapered Language Models (TLMs) principle applies monotonic capacity reduction across depth to MLPs and other parameter-bearing components. This challenges the default uniform-width architecture inherited from the original transformer.
HOW THIS AFFECTS YOU
●
researcherTLMs offer a drop-in architectural modification testable under fixed parameter budgets, with controlled experiments showing perplexity gains over uniform-width baselines.