Physics-IQ Benchmark Audit Exposes Measurement Flaws in Video Model Evaluation
June 16, 2026
A systematic audit of the Physics-IQ benchmark for video generative models identifies confounding factors in prompt and ground-truth quality that distort physical understanding scores. Three proposed fixes improve prompt clarity, ground-truth fidelity, and introduce sample-level evaluation granularity.
HOW THIS AFFECTS YOU
●
researcherIf you're using Physics-IQ to evaluate world models or video generation, the identified confounds mean prior results may not accurately reflect physical reasoning capability.