[arXiv]score: 0.15
VLM Judges Unreliable for Visually Impaired Assistance Tasks — GPT-5.4 Hits Only 52.6% Diagnostic Accuracy
June 1, 2026
VIABLE, a 300K-sample benchmark across three VIA scenarios, finds that all seven tested VLM judges are unreliable evaluators for visually impaired assistance tasks. GPT-5.4 scores highest at 52.6% single-failure diagnostic accuracy but also shows the highest self-preference rate at 94.2%, undermining its use as an impartial judge.
cs.CLcs.CV
HOW THIS AFFECTS YOU
●
builderIf you are using VLM-as-a-Judge pipelines for accessibility applications, these results suggest you cannot trust current models including GPT-5.4 for reliable automated evaluation.
●
researcherThe 12-mode failure taxonomy and Effectiveness-Impartiality-Stability framework give you concrete axes to improve VLM judge design for accessibility domains.
●
policySelf-preference rates of 94.2% in the strongest available judge raise fairness and reliability concerns for AI systems deployed in assistive technology contexts.