AC-ODM Uses RL Policy to Dynamically Mix Pretraining Data More Efficiently
June 13, 2026
AC-ODM frames LLM pretraining data mixing as a reinforcement learning problem, using a parameterized policy proven to act as a dynamic linear surrogate that maximizes constructive gradient interference. It supports a proxy mode where a policy trained on a small model transfers to a larger target, and a non-proxy mode for streaming corpora, improving sample efficiency over static mixing strategies.
HOW THIS AFFECTS YOU
●
researcherThe proxy-mode policy transfer from small to large models is a practically useful technique for reducing the compute cost of data mixing optimization in large pretraining runs.