[arXiv]score: 0.15

Trajectory-Level Distillation Fix Outperforms Token-Level Loss Reweighting

June 9, 2026

On-policy distillation suffers from prefix failure, where token-level supervision creates bimodal teacher mixtures and fragmented gradients that token-loss reweighting cannot fix. TRD corrects student rollouts at the trajectory level before distillation, addressing the root cause rather than symptoms. The method also improves exploration by exposing students to alternative valid derivations.

HOW THIS AFFECTS YOU

●

researcherTRD offers a principled trajectory-level alternative to token-loss truncation for on-policy distillation, with a concrete diagnosis of prefix failure as the structural cause.

read original ↗arxiv.org

← back to feed