[arXiv]score: 0.82
Reasoning Degrades GPT-5.4 on Clinical SOAP Notes; Non-Reasoning Config Wins Overall
May 26, 2026
Across OMI Health, ACI-Bench, and PriMock57 datasets, enabling reasoning significantly degrades GPT-5.4 SOAP note quality while DeepSeek-V4-Flash performs best among reasoning-enabled models; same-source RAG yields smaller, model-dependent gains.
cs.CLcs.AIcs.LG
HOW THIS AFFECTS YOU
●
builderFor clinical documentation pipelines, disabling native reasoning on GPT-5.4 outperforms reasoning-enabled configurations—a counterintuitive deployment decision with direct cost and quality implications.
●
researcherControlled 2x2 evaluation isolating reasoning and RAG effects reveals that reasoning capability does not transfer to structured clinical documentation tasks, challenging assumptions about reasoning model generalization.
●
healthNon-reasoning GPT-5.4 achieves highest SOAP note quality across three clinical datasets, providing concrete model selection guidance for EHR documentation automation.