[HN]score: 0.41

DeepSWE Benchmark: Contamination-Free Coding Agent Eval Across 91 Repos, 5 Languages

May 26, 2026

DeepSWE is a from-scratch software engineering benchmark covering 91 repos in 5 languages where solutions require 5.5x more code than SWE-bench Pro tasks, and audits of SWE-bench Pro found 8% false positive and 24% false negative verifier error rates.

HOW THIS AFFECTS YOU

●

builderYou can use DeepSWE to get more reliable signal on coding agent performance, since it separates models that appear close on public benchmarks into ordered gaps matching real-world developer experience.

●

researcherSWE-bench Pro's 24% false negative verifier rate undermines its validity as a frontier benchmark; DeepSWE's hand-written behavioral verifiers and contamination-free task construction set a higher methodological bar.

●

founderWorth watching because benchmark inflation on SWE-bench Pro means competitive claims from coding agent vendors may be overstated; DeepSWE could become the credibility standard for the category.

SOURCE

https://deepswe.datacurve.ai/blog

← back to feed