[HUGGINGFACE]score: 0.42

Trust-Region Behavior Blending Stabilizes Early On-Policy Distillation Training

May 28, 2026

TRB addresses weak early rollouts in on-policy distillation by substituting the rollout policy with the closest-to-teacher behavior within a KL trust region, then annealing the budget to zero to return to pure student rollouts. Across two math-reasoning distillation settings, TRB achieves the best average performance among compared methods.

paper

HOW THIS AFFECTS YOU

●

researcherTRB is a drop-in warmup method for on-policy distillation that addresses a known instability without changing the per-prefix training objective.

SOURCE

https://huggingface.co/papers/2605.31159

← back to feed