[arXiv]score: 0.74
MLLMs Produce 5–19x Shorter Outputs on Image Text Input, Suppressing Reasoning Not Perception
May 26, 2026
Across seven MLLMs and seven benchmarks, the modality gap between text-as-image and text-as-tokens is primarily caused by reasoning suppression under image input — models skip computation steps — not perceptual failure, and is highly sensitive to font and resolution choices.
cs.CLcs.CV
HOW THIS AFFECTS YOU
●
builderIf your pipeline passes text as images (OCR, document QA, PDFs), you may be silently losing chain-of-thought quality — rendering choices like font and resolution matter significantly.
●
researcherSystematic diagnosis of the modality gap reveals it is a reasoning behavior artifact, not a perceptual limitation, with grounded-theory error analysis across 4,000+ examples.
●
designerDocument and UI design choices (font, resolution, layout) measurably affect MLLM reasoning quality, making visual presentation a functional engineering concern.