[HN]score: 0.35

GPT-5.5 Hits 50% CVE Fix Rate; No Frontier Model Reliably Patches Vulnerabilities

May 29, 2026

Across 20 real CVEs with three prompt types, GPT-5.5 achieved the highest solve rate at 50% overall and 60% under full advisory conditions; Poolside models scored lower with statistically significant cross-family differences under McNemar testing. Failure modes include wrong-search drift, budget exhaustion, and partial fixes, indicating current agents cannot be trusted for autonomous vulnerability remediation.

HOW THIS AFFECTS YOU

●

builderA 50% ceiling on CVE patching under ideal conditions means you cannot deploy current agents for autonomous security remediation without human review on every fix.

●

researcherThe benchmark methodology — three prompt types, McNemar significance testing, and post-hoc correction for false negatives — is a reusable eval design for agentic security tasks.

●

policyEven the best frontier model fails half of real vulnerability patches, which is directly relevant to any regulatory framework considering AI in critical infrastructure security workflows.

SOURCE

https://giovannigatti.github.io/cve-bench/

← back to feed