[HUGGINGFACE]score: 0.48
FiRe-OPD Jointly Filters Trajectories and Soft-Reweights Tokens in LLM On-Policy Distillation
May 31, 2026
FiRe-OPD improves on-policy distillation by first removing low-quality rollout trajectories, then applying soft token-level reweighting within retained trajectories to emphasize informative tokens — avoiding the information loss of hard token masking. The method targets the optimization granularity gap between full-trace KL supervision and selective training approaches.
paper
HOW THIS AFFECTS YOU
●
researcherThe two-stage filter-then-reweight formulation is a concrete improvement over hard token selection in on-policy distillation — benchmark numbers worth checking against your own KL-based baselines.