[arXiv]score: 0.17
WebRISE Benchmark: Best MLLMs Hit Only 65.6% on Web Interaction Tasks
June 3, 2026
WebRISE evaluates MLLM-generated web artifacts using Interaction Contract Graphs across 442 tasks, 5,495 state transitions, and 5,271 requirement checks. Top models reach only 65.6% transition validity, and visual quality is a poor proxy for functional correctness — Qwen3.6-35B-A3B scores V=80.8 but T=15.5 on Markdown tasks.
cs.CLcs.AI
HOW THIS AFFECTS YOU
●
builderIf you're shipping MLLM-based web generation, these numbers suggest visual quality metrics are unreliable — functional state coverage needs separate evaluation.
●
researcherWebRISE exposes a concrete gap between visual fidelity and behavioral correctness in web generation, with 14-model comparisons across five input modalities.