[arXiv]score: 0.12
Prune-OPD Cuts Wasted Compute in Long-Horizon On-Policy Distillation
May 29, 2026
On-policy distillation degrades when student trajectories drift from the teacher's reasoning path, wasting compute on unreliable reward signals. Prune-OPD monitors top-k token overlap between student and teacher in real time, down-weighting rewards and truncating rollouts when drift is detected, improving both training efficiency and reward quality on long-horizon tasks.
cs.LGcs.AI
HOW THIS AFFECTS YOU
●
researcherDirectly addresses a known failure mode in OPD scaling — the drift-detection and dynamic truncation mechanism is a concrete, implementable improvement for long-chain reasoning distillation pipelines.