[arXiv]score: 0.15
TOPD Fixes On-Policy Distillation by Targeting Real Reasoning Forks
June 2, 2026
Standard on-policy distillation misidentifies roughly 30% of high-loss tokens as reasoning errors when they are surface-form mismatches, and fails to correct genuine divergences because reasoning failures unfold as short-horizon drift rather than isolated token errors. TOPD uses near-future trajectory context to distinguish real divergent states and apply distributional correction at the trajectory level rather than the token level.
cs.CLcs.AI
HOW THIS AFFECTS YOU
●
researcherTOPD offers a concrete fix to a measurable failure mode in OPD training pipelines, with the 30% misidentification rate providing a diagnostic benchmark for evaluating your own distillation setup.