[arXiv]score: 0.15
Toxic Prompt Phrasing Degrades LLM Factual Accuracy Across ARC, GSM8K, MMLU
June 1, 2026
Across five LLMs evaluated on ARC-Easy, GSM8K, and MMLU, toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing shows limited effect. Attribution-graph analysis shows toxicity amplifies perturbation-sensitive internal nodes, suggesting the degradation has a mechanistic basis rather than being surface-level noise.
cs.CLcs.AIcs.CYcs.HC
HOW THIS AFFECTS YOU
●
builderProduction systems handling adversarial or emotionally charged user input should account for measurable accuracy drops — input sanitization may have functional, not just safety, value.
●
researcherAttribution-graph circuit analysis linking prompt toxicity to specific internal activation patterns is a concrete mechanistic finding worth examining.
●
policyEmpirical evidence that toxic phrasing degrades factual reliability strengthens the case for input moderation requirements in high-stakes deployments.