[arXiv]score: 0.71

Diffusion Model as Reward Backbone Enables Step-Wise GRPO for Image Alignment

May 26, 2026

DRM uses a pre-trained diffusion model as a reward backbone to score noisy intermediate latents at each denoising step, and Step-wise GRPO applies these dense per-step rewards to improve alignment with perceptual qualities like aesthetics and composition.

cs.CV

HOW THIS AFFECTS YOU

●

builderStep-wise GRPO could improve image generation quality in fine-tuning pipelines without requiring a separate VLM reward model.

●

researcherUsing diffusion model internals for step-wise reward signals is a novel alternative to VLM-based reward models, with direct access to perceptual generation priors.

SOURCE

https://arxiv.org/abs/2605.25661

← back to feed