[arXiv]score: 0.15
TSM-Bench Shows SOTA Detectors Fail on Task-Specific LLM Edits in Wikipedia
June 1, 2026
TSM-Bench is a multilingual, multi-generator benchmark targeting real-world Wikipedia editing tasks like summarization, where LLM-generated text closely resembles human writing due to contextual constraints. SOTA machine-generated text detectors show significantly reduced accuracy on task-specific edits compared to generic generation benchmarks.
cs.CL
HOW THIS AFFECTS YOU
●
researcherExisting MGT detection benchmarks overestimate real-world detector performance — TSM-Bench provides a harder, more ecologically valid evaluation target.
●
policyDetection tools deployed on UGC platforms like Wikipedia are likely underperforming on the most common LLM-assisted editing patterns, which has direct implications for content integrity governance.