[HN]score: 0.23

Benchmarking AGENTS.md Iteratively: Best Version Still Regressed on Holdout

May 27, 2026

Iterating AGENTS.md with Codex over 8 runs using a repo-specific benchmark improved training-slice performance but the best candidate still regressed on a clean holdout set, showing that agent instruction files need rigorous eval like any tunable system component.

HOW THIS AFFECTS YOU

●

builderYou should treat AGENTS.md/CLAUDE.md as a tunable runtime artifact with holdout evals, not a static doc — this post shows a concrete methodology for doing that.

●

researcherThe overfitting behavior on a small benchmark slice highlights a real evaluation challenge for instruction-following optimization in coding agents.

SOURCE

https://www.stet.sh/blog/how-i-used-codex-to-improve-its-own-agents-md

← back to feed