[r/LocalLLaMA]score: 0.19
ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)
May 5, 2026
ProgramBench releases a 200-task benchmark evaluating LLM agents on full binary reconstruction from executables and README files alone, no decompilation, no internet, no language constraints. Agents must independently choose language, design abstractions, and architect complete programs, validated against 6M lines of behavioral black-box tests costing roughly $50K to generate. Top models fall well short of reliable reconstruction, exposing critical gaps in autonomous software engineering that single-project case studies systematically obscured. Practitioners building coding agents or evaluating LLM software capabilities should treat this as a serious stress test replacing anecdotal benchmarks.
discussion