[r/MachineLearning]score: 0.10
Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]
April 25, 2026
**What it is:** An open-source benchmark suite (`paper-lantern-challenges`) evaluating whether giving a coding agent (Claude Opus 4.6 planner + Gemini Flash 3 task model) retrieval access to CS literature improves performance across 9 software tasks, with per-task accuracy deltas ranging from +0.010 to +0.320 across metrics like mutation score, SQL execution accuracy, and classification accuracy.
**Why it matters:** The benchmark provides fully reproducible artifacts (prompts, agent code paths, prediction files) for a relatively underexplored question — whether RAG over academic literature meaningfully improves agentic coding pipelines — which gives practitioners a concrete baseline for evaluating similar retrieval-augmented agent designs. However, the author discloses a conflict of interest (the retrieval system under test is their own product), single-pass evaluation with no retries limits statistical reliability, and the wide variance in deltas (+0.010 to +0.320) suggests task-specific sensitivity that warrants independent replication before drawing general conclusions.
project