[HN]score: 0.21
Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs
May 7, 2026
Agent-skills-eval is an open-source test runner that empirically validates Anthropic's Agent Skills standard by running identical prompts with and without a SKILL.md context injection, then using a judge model to score both outputs. The CLI supports any OpenAI-compatible endpoint including GPT-4o-mini, Groq, and local Llama servers, producing JSON artifacts and static HTML diff reports. ML engineers shipping domain-specific agent skills finally have a rigorous A/B harness instead of anecdotal validation. This fills a critical gap in the Agent Skills ecosystem that previously had no standardized evaluation layer.