[arXiv]score: 0.15

TSM-Bench Shows SOTA Detectors Fail on Task-Specific LLM Edits in Wikipedia

June 1, 2026

TSM-Bench is a multilingual, multi-generator benchmark targeting real-world Wikipedia editing tasks like summarization, where LLM-generated text closely resembles human writing due to contextual constraints. SOTA machine-generated text detectors show significantly reduced accuracy on task-specific edits compared to generic generation benchmarks.

cs.CL

HOW THIS AFFECTS YOU

●

researcherExisting MGT detection benchmarks overestimate real-world detector performance — TSM-Bench provides a harder, more ecologically valid evaluation target.

●

policyDetection tools deployed on UGC platforms like Wikipedia are likely underperforming on the most common LLM-assisted editing patterns, which has direct implications for content integrity governance.

SOURCE

https://arxiv.org/abs/2605.31113

← back to feed