[arXiv]score: 0.15

Toxic Prompt Phrasing Degrades LLM Factual Accuracy Across ARC, GSM8K, MMLU

June 1, 2026

Across five LLMs evaluated on ARC-Easy, GSM8K, and MMLU, toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing shows limited effect. Attribution-graph analysis shows toxicity amplifies perturbation-sensitive internal nodes, suggesting the degradation has a mechanistic basis rather than being surface-level noise.

cs.CLcs.AIcs.CYcs.HC

HOW THIS AFFECTS YOU

●

builderProduction systems handling adversarial or emotionally charged user input should account for measurable accuracy drops — input sanitization may have functional, not just safety, value.

●

researcherAttribution-graph circuit analysis linking prompt toxicity to specific internal activation patterns is a concrete mechanistic finding worth examining.

●

policyEmpirical evidence that toxic phrasing degrades factual reliability strengthens the case for input moderation requirements in high-stakes deployments.

SOURCE

https://arxiv.org/abs/2605.30913

← back to feed