[arXiv]score: 0.13

BilliardPhys-Bench Exposes Stasis Bias in GPT, Claude, Gemini, Qwen Physical Reasoning

June 1, 2026

A procedurally generated billiards benchmark tests ball-collision prediction, wall-bounce reasoning, and final-position estimation across GPT, Claude, Gemini, and Qwen model families. All evaluated MLLMs show performance drops as simulation time and scene complexity increase, and exhibit a consistent stasis bias — defaulting to predicting no interaction when the correct outcome is harder to infer.

cs.AIphysics.app-ph

HOW THIS AFFECTS YOU

●

researcherStasis bias as a named, reproducible failure mode in physical reasoning benchmarks gives a concrete target for multimodal world-model improvement.

SOURCE

https://arxiv.org/abs/2605.30900

← back to feed