ReNIO Upweights Incorrect LLM Rollouts to Improve On-Policy Distillation
June 21, 2026
ReNIO finds that training only on incorrect student-generated outputs outperforms training only on correct ones during on-policy distillation, because incorrect trajectories preserve exploratory reasoning near the model's capability boundary. The method reweights negative trajectories without requiring full answer-containing rollouts.
HOW THIS AFFECTS YOU
●
researcherThe asymmetry finding — incorrect-only training beats correct-only — is a concrete, actionable insight for anyone designing LLM reasoning distillation pipelines.