[HUGGINGFACE]score: 0.42
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
June 10, 2026
EvoBrowseComp is a 800-question benchmark (400 English, 400 Chinese) for evaluating search agents on questions synthesized from live web data, preventing contamination and parametric memorization that inflate scores on static benchmarks like BrowseComp. A three-agent pipeline handles QA synthesis, verification, and filtering to ensure questions require genuine retrieval rather than fact recall. The evolving design keeps the benchmark current as web content changes.