[arXiv]score: 0.24

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

May 7, 2026

Researchers evaluated five MLLMs, including GPT-4.1, InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, and MedGemma-4B-Instruct, on 5,811 real-world dermatology cases with 46,405 images across differential diagnosis and severity triage tasks. GPT-4.1 achieved top-3 diagnostic accuracy of 42.25% on public benchmarks, but performance dropped substantially on hospital consultation cohorts. Clinicians, health AI developers, and regulators deploying vision-language models in dermatology workflows must treat benchmark scores as unreliable proxies for clinical utility, demanding real-world validation before deployment.

cs.CVcs.AIcs.CY

SOURCE

https://arxiv.org/abs/2605.04098

← back to feed