[arXiv]score: 0.10

CaC Video Reward Model Uses Coarse-to-Fine Spatiotemporal Anomaly Detection

May 29, 2026

CaC is a VLM-based reward model for generated video evaluation that first anchors anomalous temporal windows globally, then grounds anomalies spatially, using structured chain-of-thought and two-turn GRPO fine-tuning. It introduces a new large-scale dataset with per-frame bounding boxes, temporal anomaly windows, and attribution labels to train this capability.

cs.CVcs.AI

HOW THIS AFFECTS YOU

●

builderIf you're building video generation evaluation pipelines, this dataset and model architecture provide a more structured reward signal than flat frame-level scoring.

●

researcherThe three-stage training paradigm and GRPO-based optimization offer a replicable recipe for building spatiotemporally-aware video reward models.

SOURCE

https://arxiv.org/abs/2605.11723

← back to feed