[HN]score: 0.23
Benchmarking AGENTS.md Iteratively: Best Version Still Regressed on Holdout
May 27, 2026
Iterating AGENTS.md with Codex over 8 runs using a repo-specific benchmark improved training-slice performance but the best candidate still regressed on a clean holdout set, showing that agent instruction files need rigorous eval like any tunable system component.
HOW THIS AFFECTS YOU
●
builderYou should treat AGENTS.md/CLAUDE.md as a tunable runtime artifact with holdout evals, not a static doc — this post shows a concrete methodology for doing that.
●
researcherThe overfitting behavior on a small benchmark slice highlights a real evaluation challenge for instruction-following optimization in coding agents.