[arXiv]score: 0.22
VideoMLA Cuts Video Diffusion KV Cache Memory 92.7% with Low-Rank Latents
May 29, 2026
VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head KV pairs with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% per-token KV memory reduction while retaining generation quality. Notably, the compression works despite video attention not being low-rank in the spectral sense that motivates MLA in language models.
cs.CVcs.AI
HOW THIS AFFECTS YOU
●
builderA 92.7% KV memory reduction at every cached layer directly reduces streaming memory and latency for long-rollout video generation, making minute-scale video diffusion more feasible on constrained hardware.
●
researcherThe finding that MLA succeeds in video diffusion despite high effective rank in pretrained attention weights challenges the standard spectral justification and warrants further investigation into why low-rank compression generalizes here.