[arXiv]score: 0.09

Survey Maps Five-Stage Evolution of Vision-Language Perception in MLLMs

June 26, 2026

A systematic survey formalizes unified vision-language perception in MLLMs as a single integrated capability rather than separate modalities, covering the paradigm shift driven by models like OpenAI O-series and DeepSeek R-series. It introduces a five-stage evolutionary framework for cross-modal perception development.

HOW THIS AFFECTS YOU

●

researcherUseful as a structured reference for positioning new MLLM perception work within the broader architectural evolution from early fusion to reasoning-centric models.

read original ↗arxiv.org

← back to feed