[HUGGINGFACE]score: 0.48

GDSD Eliminates ELBO Bias in RL Training for Diffusion Language Models

May 27, 2026

Standard RL for diffusion LLMs approximates policy likelihood with an ELBO estimated from randomly masked sequences, introducing training-inference mismatch bias. GDSD instead distills the denoiser directly from an advantage-guided self-teacher derived from the closed-form optimum of reverse-KL regularized RL, removing the ELBO surrogate.

paper

HOW THIS AFFECTS YOU

●

researcherGDSD provides a theoretically cleaner RL training signal for diffusion LLMs that avoids the bias introduced by ELBO-based likelihood surrogates.

SOURCE

https://huggingface.co/papers/2605.29398

← back to feed