[arXiv]score: 0.10

Frontier LLMs plateau at 90.8% on VerilogEval, blocked by unsolvable functional errors

June 19, 2026

A new error taxonomy for RTL code generation categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Frontier models hit a hard ceiling at 90.8% pass rate on VerilogEval, with unsolvable functional errors immune to test-time compute scaling. Optimization that eliminates syntax errors concurrently worsens deeper functional failures.

HOW THIS AFFECTS YOU

●

builderIf you are building LLM-assisted RTL or hardware design tools, this benchmark ceiling signals that current models cannot be reliably pushed past ~91% correctness without architectural changes.

●

researcherThe taxonomy and empirical ceiling provide a concrete framework for diagnosing where scaling and alignment techniques stop helping in hardware design tasks.

read original ↗arxiv.org

← back to feed