Feedback Distillation Beats GRPO on Lean4 Theorem Proving Pass@k
May 28, 2026
Feedback Distillation trains models to match their own output distribution conditioned on privileged LLM-generated feedback at the token level, avoiding GRPO's sparse reward and mode collapse issues. On Lean4 theorem proving, it achieves higher policy entropy and better pass@k scaling than GRPO while allowing injection of external knowledge.
HOW THIS AFFECTS YOU
●
researcherToken-level supervision with privileged feedback is a concrete alternative to GRPO for formal reasoning tasks — the pass@k and entropy comparisons are directly reproducible.