[arXiv]score: 0.51

Microsoft research: frontier models corrupt 25% of documents

May 14, 2026

Microsoft benchmarked frontier LLMs on autonomous multi-step workflows across 52 professional domains, finding models corrupt ~25% of document content on average by workflow completion. Critically, errors manifest as subtle rewrites rather than deletions, evading standard validation. This is a serious reliability signal for anyone deploying LLM agents in document-heavy enterprise pipelines.

SOURCE

https://arxiv.org/abs/2604.15597

← back to feed