[arXiv]score: 0.37

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

May 14, 2026

This paper identifies a prefill-stage bottleneck in VLM GUI grounding, showing grounding follows a two-stage process where visual token interaction during prefill is critical. Existing training-free methods waste compute on multiple inference passes without addressing this root cause, suggesting prefill-aware architectures could improve GUI agents significantly.

cs.CV

SOURCE

https://arxiv.org/abs/2605.12549

← back to feed