[arXiv]score: 0.18

540-image phrasing-controlled benchmark exposes VLM textual-prior reliance across 11 models

June 10, 2026

A new 540-image benchmark with four question phrasings per image across six reasoning categories shows every tested VLM degrades on the hardest image-grounded variant, with open-weight models dropping furthest. A no-image ablation is used as the central diagnostic for textual-prior reliance.

HOW THIS AFFECTS YOU

●

builderIf your product relies on VLMs answering image-grounded questions, this benchmark can quantify how much your model is guessing from question phrasing rather than image content.

●

researcherThe phrasing-controlled design isolates textual prior from visual reasoning more cleanly than existing benchmarks, making it a useful evaluation tool for VLM robustness work.

read original ↗arxiv.org

← back to feed