Clinical Reasoning Benchmarks Show Low Pass Rates for Critical Criteria
July 3, 2026
A clinician-authored rubric evaluation of GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro reveals that frontier models fail most critical medical tasks. While low-stakes criteria saw 80-90% pass rates, highest-weighted clinical criteria passed at only 32.4-41.7%.
HOW THIS AFFECTS YOU
●
researcherThis highlights the need for rubric-based, weighted evaluations rather than simple multiple-choice benchmarks.
●
healthExpect low reliability in high-stakes diagnostic tasks despite high scores on general medical benchmarks.