[arXiv]score: 0.17

Medical LLMs Including BioLlama3 Remain Brittle to Minor Prompt Perturbations

June 8, 2026

Systematic sensitivity analysis of GPT-3.5, Llama3, ClinicalBERT, BioLlama3, and BioBERT on MedMCQA shows that both lexical and syntactic prompt perturbations alter clinical advice, with adversarial prompts provoking harmful outputs across all tested models. Medical-specific fine-tuning does not confer meaningful robustness over general-purpose models.

HOW THIS AFFECTS YOU

●

builderYou should implement prompt canonicalization or ensemble consistency checks in any healthcare LLM pipeline to reduce sensitivity to phrasing variation.

●

policyThis provides empirical grounding for requiring adversarial robustness testing as part of clinical AI validation and regulatory review.

●

healthMedical-specific LLMs are not intrinsically safer than general-purpose ones — prompt sensitivity analysis should be a required step before any clinical deployment.

read original ↗arxiv.org

← back to feed