[r/MachineLearning]score: 0.13
On-Policy Distillation Explained: Key Post-Training Behind Qwen 3, DeepSeek-V4, GLM-5.1
June 4, 2026
On-policy distillation (OPD) generates training data by sampling from the student model rather than the teacher, and is the core post-training technique behind Qwen 3.6, Qwen 3.7, GLM-5.1, and DeepSeek-V4. Sasha Rush's whiteboard explanation with Dwarkesh provides a practitioner-accessible breakdown of the method.
research
HOW THIS AFFECTS YOU
●
builderUnderstanding OPD is directly relevant if you're fine-tuning or distilling from frontier models, as it explains why on-policy data generation outperforms static teacher outputs.
●
researcherOPD's role across multiple frontier models makes it a high-priority technique to understand for anyone working on post-training alignment or distillation pipelines.