TRQAM Stabilizes Off-Policy Fine-Tuning of Flow Policies via Trust Region KL Control
May 25, 2026
Trust Region Q-Adjoint Matching (TRQAM) extends QAM by adding adaptive path-space KL control via projected dual descent, preventing model collapse from ill-conditioned critics during off-policy RL fine-tuning of pretrained flow policies. The trust-region parameter λ is optimized within the stochastic optimal control dynamics.
HOW THIS AFFECTS YOU
●
researcherIf you're fine-tuning diffusion or flow-based policies with RL, TRQAM's stability improvement over QAM addresses a known collapse failure mode worth testing on your training setup.