[arXiv]score: 0.71
Diffusion Model as Reward Backbone Enables Step-Wise GRPO for Image Alignment
May 26, 2026
DRM uses a pre-trained diffusion model as a reward backbone to score noisy intermediate latents at each denoising step, and Step-wise GRPO applies these dense per-step rewards to improve alignment with perceptual qualities like aesthetics and composition.
cs.CV
HOW THIS AFFECTS YOU
●
builderStep-wise GRPO could improve image generation quality in fine-tuning pipelines without requiring a separate VLM reward model.
●
researcherUsing diffusion model internals for step-wise reward signals is a novel alternative to VLM-based reward models, with direct access to perceptual generation priors.