[HUGGINGFACE]score: 0.80
Staged VLM Training Shows Perception Bottlenecks Reasoning; RL Beats SFT for Visual Tasks
May 18, 2026
Decomposing VLM post-training into three sequential stages — visual perception, visual reasoning, and textual reasoning — shows that perception is the primary bottleneck, and RL outperforms caption-based SFT for learning visual perception.
paper