[HUGGINGFACE]score: 0.48

FiRe-OPD Jointly Filters Trajectories and Soft-Reweights Tokens in LLM On-Policy Distillation

May 31, 2026

FiRe-OPD improves on-policy distillation by first removing low-quality rollout trajectories, then applying soft token-level reweighting within retained trajectories to emphasize informative tokens — avoiding the information loss of hard token masking. The method targets the optimization granularity gap between full-trace KL supervision and selective training approaches.

paper

HOW THIS AFFECTS YOU

●

researcherThe two-stage filter-then-reweight formulation is a concrete improvement over hard token selection in on-policy distillation — benchmark numbers worth checking against your own KL-based baselines.

SOURCE

https://huggingface.co/papers/2606.02684

← back to feed