[HUGGINGFACE]score: 0.42

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

June 10, 2026

EvoBrowseComp is a 800-question benchmark (400 English, 400 Chinese) for evaluating search agents on questions synthesized from live web data, preventing contamination and parametric memorization that inflate scores on static benchmarks like BrowseComp. A three-agent pipeline handles QA synthesis, verification, and filtering to ensure questions require genuine retrieval rather than fact recall. The evolving design keeps the benchmark current as web content changes.

read original ↗huggingface.co

← back to feed