[HUGGINGFACE]score: 0.42
MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
June 6, 2026
MaskAlign selectively applies representation alignment only to high-gradient tokens rather than all spatial tokens, addressing the timestep-noise mismatch between diffusion inputs and clean reference features from self-supervised encoders. The method identifies tokens with large alignment-gradient norms, which show stable spatial preferences, and restricts the alignment objective to that subset. This reduces training overhead while maintaining or improving convergence speed and generation quality in diffusion transformer training.