[HN]score: 0.35
GPT-5.5 Hits 50% CVE Fix Rate; No Frontier Model Reliably Patches Vulnerabilities
May 29, 2026
Across 20 real CVEs with three prompt types, GPT-5.5 achieved the highest solve rate at 50% overall and 60% under full advisory conditions; Poolside models scored lower with statistically significant cross-family differences under McNemar testing. Failure modes include wrong-search drift, budget exhaustion, and partial fixes, indicating current agents cannot be trusted for autonomous vulnerability remediation.
HOW THIS AFFECTS YOU
●
builderA 50% ceiling on CVE patching under ideal conditions means you cannot deploy current agents for autonomous security remediation without human review on every fix.
●
researcherThe benchmark methodology — three prompt types, McNemar significance testing, and post-hoc correction for false negatives — is a reusable eval design for agentic security tasks.
●
policyEven the best frontier model fails half of real vulnerability patches, which is directly relevant to any regulatory framework considering AI in critical infrastructure security workflows.