[arXiv]score: 0.74

MLLMs Produce 5–19x Shorter Outputs on Image Text Input, Suppressing Reasoning Not Perception

May 26, 2026

Across seven MLLMs and seven benchmarks, the modality gap between text-as-image and text-as-tokens is primarily caused by reasoning suppression under image input — models skip computation steps — not perceptual failure, and is highly sensitive to font and resolution choices.

cs.CLcs.CV

HOW THIS AFFECTS YOU

●

builderIf your pipeline passes text as images (OCR, document QA, PDFs), you may be silently losing chain-of-thought quality — rendering choices like font and resolution matter significantly.

●

researcherSystematic diagnosis of the modality gap reveals it is a reasoning behavior artifact, not a perceptual limitation, with grounded-theory error analysis across 4,000+ examples.

●

designerDocument and UI design choices (font, resolution, layout) measurably affect MLLM reasoning quality, making visual presentation a functional engineering concern.

SOURCE

https://arxiv.org/abs/2603.09095

← back to feed