[HUGGINGFACE]score: 0.48
GDSD Eliminates ELBO Bias in RL Training for Diffusion Language Models
May 27, 2026
Standard RL for diffusion LLMs approximates policy likelihood with an ELBO estimated from randomly masked sequences, introducing training-inference mismatch bias. GDSD instead distills the denoiser directly from an advantage-guided self-teacher derived from the closed-form optimum of reverse-KL regularized RL, removing the ELBO surrogate.
paper
HOW THIS AFFECTS YOU
●
researcherGDSD provides a theoretically cleaner RL training signal for diffusion LLMs that avoids the bias introduced by ELBO-based likelihood surrogates.