[arXiv]score: 0.51
Microsoft research: frontier models corrupt 25% of documents
May 14, 2026
Microsoft benchmarked frontier LLMs on autonomous multi-step workflows across 52 professional domains, finding models corrupt ~25% of document content on average by workflow completion. Critically, errors manifest as subtle rewrites rather than deletions, evading standard validation. This is a serious reliability signal for anyone deploying LLM agents in document-heavy enterprise pipelines.