[arXiv]score: 0.13
BilliardPhys-Bench Exposes Stasis Bias in GPT, Claude, Gemini, Qwen Physical Reasoning
June 1, 2026
A procedurally generated billiards benchmark tests ball-collision prediction, wall-bounce reasoning, and final-position estimation across GPT, Claude, Gemini, and Qwen model families. All evaluated MLLMs show performance drops as simulation time and scene complexity increase, and exhibit a consistent stasis bias — defaulting to predicting no interaction when the correct outcome is harder to infer.
cs.AIphysics.app-ph
HOW THIS AFFECTS YOU
●
researcherStasis bias as a named, reproducible failure mode in physical reasoning benchmarks gives a concrete target for multimodal world-model improvement.