[arXiv]score: 0.14
Deep Research Pipeline Raises Literature Search Recall from 20% to 80%
May 29, 2026
A bibliography-expanding Deep Research pipeline on the 250-paper RollingEval-Jun25 benchmark raises recall from under 20% to over 80% versus vanilla API search. An LLM-as-judge evaluation finds only 51% of human citations are moderately relevant or higher, versus 86–88% for top AI re-rankers, with humans 2.5x more likely to cite direct collaborators.
cs.AIcs.IR
HOW THIS AFFECTS YOU
●
builderThe breadth-first bibliography expansion technique is directly implementable to improve recall in any RAG pipeline over academic or document corpora.
●
researcherHuman citation lists are a biased evaluation target due to collaboration network effects — RollingEval-Jun25 and LLM-as-judge offer a more neutral benchmark for literature retrieval.