[arXiv]score: 0.32

Clinical Reasoning Benchmarks Show Low Pass Rates for Critical Criteria

July 3, 2026

A clinician-authored rubric evaluation of GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro reveals that frontier models fail most critical medical tasks. While low-stakes criteria saw 80-90% pass rates, highest-weighted clinical criteria passed at only 32.4-41.7%.

HOW THIS AFFECTS YOU

●

researcherThis highlights the need for rubric-based, weighted evaluations rather than simple multiple-choice benchmarks.

●

healthExpect low reliability in high-stakes diagnostic tasks despite high scores on general medical benchmarks.

read original ↗arxiv.org

DAILY DIGEST

catch up on AI in 2 minutes, every morning. free. unsubscribe anytime. privacy

← back to feed