[arXiv]score: 0.15

VLM Judges Unreliable for Visually Impaired Assistance Tasks — GPT-5.4 Hits Only 52.6% Diagnostic Accuracy

June 1, 2026

VIABLE, a 300K-sample benchmark across three VIA scenarios, finds that all seven tested VLM judges are unreliable evaluators for visually impaired assistance tasks. GPT-5.4 scores highest at 52.6% single-failure diagnostic accuracy but also shows the highest self-preference rate at 94.2%, undermining its use as an impartial judge.

cs.CLcs.CV

HOW THIS AFFECTS YOU

●

builderIf you are using VLM-as-a-Judge pipelines for accessibility applications, these results suggest you cannot trust current models including GPT-5.4 for reliable automated evaluation.

●

researcherThe 12-mode failure taxonomy and Effectiveness-Impartiality-Stability framework give you concrete axes to improve VLM judge design for accessibility domains.

●

policySelf-preference rates of 94.2% in the strongest available judge raise fairness and reliability concerns for AI systems deployed in assistive technology contexts.

SOURCE

https://arxiv.org/abs/2605.31351

← back to feed