[HUGGINGFACE]score: 0.48

View Dropout Training Forces Unified Multimodal Models to Use Visual Thinking Traces

May 25, 2026

View Dropout (VDrop) hides parts of one input view from the answer span during training while keeping them visible to thinking-image tokens, forcing unified multimodal models to encode spatial information in intermediate visual traces rather than ignoring them. The method targets cross-view spatial reasoning, where language-only reasoning loses fine-grained geometry.

paper

HOW THIS AFFECTS YOU

●

researcherVDrop is a concrete training intervention for making visual chain-of-thought traces causally relevant rather than decorative, with measurable effects on cross-view spatial reasoning benchmarks.

SOURCE

https://huggingface.co/papers/2605.27310

← back to feed