●builderIf your product relies on VLMs answering image-grounded questions, this benchmark can quantify how much your model is guessing from question phrasing rather than image content.
●researcherThe phrasing-controlled design isolates textual prior from visual reasoning more cleanly than existing benchmarks, making it a useful evaluation tool for VLM robustness work.