[HN]score: 0.41
DeepSWE Benchmark: Contamination-Free Coding Agent Eval Across 91 Repos, 5 Languages
May 26, 2026
DeepSWE is a from-scratch software engineering benchmark covering 91 repos in 5 languages where solutions require 5.5x more code than SWE-bench Pro tasks, and audits of SWE-bench Pro found 8% false positive and 24% false negative verifier error rates.
HOW THIS AFFECTS YOU
●
builderYou can use DeepSWE to get more reliable signal on coding agent performance, since it separates models that appear close on public benchmarks into ordered gaps matching real-world developer experience.
●
researcherSWE-bench Pro's 24% false negative verifier rate undermines its validity as a frontier benchmark; DeepSWE's hand-written behavioral verifiers and contamination-free task construction set a higher methodological bar.
●
founderWorth watching because benchmark inflation on SWE-bench Pro means competitive claims from coding agent vendors may be overstated; DeepSWE could become the credibility standard for the category.