[r/MachineLearning]score: 0.10

Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]

April 25, 2026

**What it is:** An open-source benchmark suite (`paper-lantern-challenges`) evaluating whether giving a coding agent (Claude Opus 4.6 planner + Gemini Flash 3 task model) retrieval access to CS literature improves performance across 9 software tasks, with per-task accuracy deltas ranging from +0.010 to +0.320 across metrics like mutation score, SQL execution accuracy, and classification accuracy. **Why it matters:** The benchmark provides fully reproducible artifacts (prompts, agent code paths, prediction files) for a relatively underexplored question — whether RAG over academic literature meaningfully improves agentic coding pipelines — which gives practitioners a concrete baseline for evaluating similar retrieval-augmented agent designs. However, the author discloses a conflict of interest (the retrieval system under test is their own product), single-pass evaluation with no retries limits statistical reliability, and the wide variance in deltas (+0.010 to +0.320) suggests task-specific sensitivity that warrants independent replication before drawing general conclusions.

project

SOURCE

https://www.reddit.com/gallery/1suzqxe

← back to feed