[HUGGINGFACE]score: 0.42
Trust-Region Behavior Blending Stabilizes Early On-Policy Distillation Training
May 28, 2026
TRB addresses weak early rollouts in on-policy distillation by substituting the rollout policy with the closest-to-teacher behavior within a KL trust region, then annealing the budget to zero to return to pure student rollouts. Across two math-reasoning distillation settings, TRB achieves the best average performance among compared methods.
paper
HOW THIS AFFECTS YOU
●
researcherTRB is a drop-in warmup method for on-policy distillation that addresses a known instability without changing the per-prefix training objective.