[arXiv]score: 0.20

LLM Verbalized Confidence Is Near-Constant and Epistemically Vacuous on Clinical Data

June 19, 2026

Comparing Qwen 2.5 7B against XGBoost on clinical tabular prediction, LLM self-reported confidence stays in a 0.856–0.937 band regardless of whether actual accuracy is 49% or 75.3%, tracking prompt format rather than correctness. The LLM also shows an inverse difficulty effect, dropping to 64.8% accuracy on cases where XGBoost is 99% confident.

HOW THIS AFFECTS YOU

●

researcherCross-model attribution divergence is a practical diagnostic for identifying where LLMs fail on structured tasks without relying on self-reported confidence.

●

policyQuantified evidence that LLM confidence scores are non-informative on structured clinical tasks strengthens the case for external calibration requirements in high-stakes deployments.

●

healthWorth watching because deploying LLMs on clinical tabular data with confidence-based safeguards is unreliable — verbalized confidence does not track prediction quality.

read original ↗arxiv.org

← back to feed