[HUGGINGFACE]score: 0.48

ReNIO Upweights Incorrect LLM Rollouts to Improve On-Policy Distillation

June 21, 2026

ReNIO finds that training only on incorrect student-generated outputs outperforms training only on correct ones during on-policy distillation, because incorrect trajectories preserve exploratory reasoning near the model's capability boundary. The method reweights negative trajectories without requiring full answer-containing rollouts.

HOW THIS AFFECTS YOU

●

researcherThe asymmetry finding — incorrect-only training beats correct-only — is a concrete, actionable insight for anyone designing LLM reasoning distillation pipelines.

read original ↗huggingface.co

← back to feed