[HUGGINGFACE]score: 0.48

iVGR Internalizes Visual Grounding Into Textual CoT via RL, Drops Explicit Boxes

May 28, 2026

Forcing explicit bounding box outputs in visually grounded CoT during inference degrades MLLM performance versus standard textual CoT. iVGR uses reinforcement learning to internalize spatial localization into the textual reasoning chain, removing mandatory explicit grounding at inference time while preserving fine-grained perception gains.

paper

HOW THIS AFFECTS YOU

●

builderYou can improve MLLM accuracy on fine-grained visual tasks by training with iVGR rather than requiring explicit box outputs at inference.

●

researcherThe finding that mandatory explicit grounding hurts inference performance challenges a common assumption in visually grounded CoT design.

SOURCE

https://huggingface.co/papers/2605.31096

← back to feed