[arXiv]score: 0.19

18 Frontier MLLMs Flip Answers 24–50% of the Time Under Input Reordering

June 25, 2026

Facet-Probe audits 18 MLLMs across five ordering perturbations — option, evidence-chunk, document-rank, image-set, and mixed-modality — finding flip rates of 24–50% panel-mean across all models. Even the best model flips on 13.4% of trials, and a Gemini temperature-0 control confirms the excess is ordering-driven, not decoder stochasticity. Standard benchmarks using single canonical orderings systematically miss this reliability failure.

HOW THIS AFFECTS YOU

●

builderAny production MLLM pipeline processing documents, images, or multi-chunk evidence should randomize or canonicalize input ordering and test for flip sensitivity before deployment.

●

researcherBenchmark scores on single-ordering evaluations overstate reliability — Facet-Probe's Bayesian IRT separation of noise from bias is a methodological improvement worth adopting in evaluation pipelines.

read original ↗arxiv.org

← back to feed