[r/MachineLearning]score: 0.13

On-Policy Distillation Explained: Key Post-Training Behind Qwen 3, DeepSeek-V4, GLM-5.1

June 4, 2026

On-policy distillation (OPD) generates training data by sampling from the student model rather than the teacher, and is the core post-training technique behind Qwen 3.6, Qwen 3.7, GLM-5.1, and DeepSeek-V4. Sasha Rush's whiteboard explanation with Dwarkesh provides a practitioner-accessible breakdown of the method.

research

HOW THIS AFFECTS YOU

●

builderUnderstanding OPD is directly relevant if you're fine-tuning or distilling from frontier models, as it explains why on-policy data generation outperforms static teacher outputs.

●

researcherOPD's role across multiple frontier models makes it a high-priority technique to understand for anyone working on post-training alignment or distillation pipelines.

SOURCE

https://www.reddit.com/r/MachineLearning/comments/1twmhud/onpolicy_distillation_one_of_the_hottest_terms_on/

← back to feed