HACKOBAR_item
[arXiv]score: 0.24

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

May 7, 2026
Researchers evaluated five MLLMs, including GPT-4.1, InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, and MedGemma-4B-Instruct, on 5,811 real-world dermatology cases with 46,405 images across differential diagnosis and severity triage tasks. GPT-4.1 achieved top-3 diagnostic accuracy of 42.25% on public benchmarks, but performance dropped substantially on hospital consultation cohorts. Clinicians, health AI developers, and regulators deploying vision-language models in dermatology workflows must treat benchmark scores as unreliable proxies for clinical utility, demanding real-world validation before deployment.
cs.CVcs.AIcs.CY