[HUGGINGFACE]score: 0.48
View Dropout Training Forces Unified Multimodal Models to Use Visual Thinking Traces
May 25, 2026
View Dropout (VDrop) hides parts of one input view from the answer span during training while keeping them visible to thinking-image tokens, forcing unified multimodal models to encode spatial information in intermediate visual traces rather than ignoring them. The method targets cross-view spatial reasoning, where language-only reasoning loses fine-grained geometry.
paper
HOW THIS AFFECTS YOU
●
researcherVDrop is a concrete training intervention for making visual chain-of-thought traces causally relevant rather than decorative, with measurable effects on cross-view spatial reasoning benchmarks.