[arXiv]score: 0.10
CaC Video Reward Model Uses Coarse-to-Fine Spatiotemporal Anomaly Detection
May 29, 2026
CaC is a VLM-based reward model for generated video evaluation that first anchors anomalous temporal windows globally, then grounds anomalies spatially, using structured chain-of-thought and two-turn GRPO fine-tuning. It introduces a new large-scale dataset with per-frame bounding boxes, temporal anomaly windows, and attribution labels to train this capability.
cs.CVcs.AI
HOW THIS AFFECTS YOU
●
builderIf you're building video generation evaluation pipelines, this dataset and model architecture provide a more structured reward signal than flat frame-level scoring.
●
researcherThe three-stage training paradigm and GRPO-based optimization offer a replicable recipe for building spatiotemporally-aware video reward models.