[arXiv]score: 0.37
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
May 14, 2026
This paper identifies a prefill-stage bottleneck in VLM GUI grounding, showing grounding follows a two-stage process where visual token interaction during prefill is critical. Existing training-free methods waste compute on multiple inference passes without addressing this root cause, suggesting prefill-aware architectures could improve GUI agents significantly.
cs.CV