HACKOBAR_item
[arXiv]score: 0.24

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

May 5, 2026
CLEAR (Clinical Evaluation of Ambiguity and Reliability), introduced via arXiv:2605.01011, stress-tests 17 LLMs across 3 medical benchmarks by systematically perturbing answer-option count, abstention framing, and semantic presentation. Key finding: model accuracy degrades measurably as plausible distractors increase, and abstention reliability drops sharply when uncertainty framing replaces assertive rejection like "None of the Above." Medical AI developers and clinical NLP researchers deploying LLMs in diagnostic or triage pipelines should prioritize this, as standard MCQ benchmarks dangerously overestimate real-world robustness. CLEAR exposes structural blind spots that MedQA and similar static benchmarks cannot surface.
cs.CLcs.AIcs.LG