Step-by-Step Critiques Outperform Binary Rewards in Self-Distillation
June 8, 2026
When training a language model via self-distillation using a frozen critic, providing step-by-step critiques aligned to the solver's reasoning trace outperforms both binary reward signals (GRPO) and reference solutions as context for the self-teacher. The finding highlights that context design — not just the distillation method — is a key lever for improving model performance without inference-time context.
HOW THIS AFFECTS YOU
●
researcherDirectly informs training recipe choices for self-distillation pipelines, suggesting trace-aligned critiques as a superior feedback signal over scalar rewards.