[arXiv]score: 0.41
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
May 15, 2026
Physics-R1 audits multimodal physics benchmarks using a three-stage pipeline combining n-gram Jaccard, embedding cosine similarity, and LLM judging, uncovering 134 near-duplicates and 4,846 paraphrase contamination candidates missed by standard single-stage checks. Documents train-eval contamination, translation drift, and MCQ saturation as systemic flaws. Essential reading for anyone training or evaluating vision-language models on physics reasoning datasets.
cs.CL