Major evaluation benchmarks

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

Published: Jun 11, 2026 — 09:48 UTC

Problem
Existing benchmarks for evaluating search agents, such as BrowseComp, rely on static datasets, which can lead to test-set contamination and parametric memorization. This results in inflated performance metrics, as models may achieve high scores through fact recall rather than genuine retrieval capabilities. The authors identify the need for a benchmark that evolves with real-world knowledge and mitigates these issues, thus providing a more accurate assessment of a model’s browsing competence. This work is presented as a preprint, indicating it has not yet undergone peer review.

Method
The authors introduce EvoBrowseComp, a benchmark comprising 400 complex questions in English and 400 in Chinese, synthesized through a three-agent collaborative framework. The framework includes: (1) a QA synthesis agent that retrieves current knowledge from the live web to create question-answer pairs; (2) an information filtering agent that assesses the credibility and popularity of the retrieved information to prevent parametric shortcuts; and (3) a high-level guidance agent that structures questions into reasoning graphs, minimizing logical redundancy. This automated synthesis process allows for regular updates to the benchmark, ensuring it remains contamination-free and temporally relevant. The authors do not disclose specific training compute or architectural details for the agents involved.

Results
EvoBrowseComp demonstrates significant difficulty, necessitating broad horizontal search strategies for effective navigation. While specific performance metrics against baseline benchmarks are not detailed in the abstract, the authors emphasize that the evolving nature of the benchmark presents a scalable paradigm for high-difficulty evaluation. The benchmark’s design aims to challenge the capabilities of search agents, pushing them beyond simple fact recall to more complex reasoning and retrieval tasks.

Limitations
The authors acknowledge that the benchmark’s evolving nature may introduce variability in difficulty over time, which could complicate longitudinal comparisons of agent performance. Additionally, while the automated synthesis framework is designed to minimize contamination, the reliance on live web data may still expose the benchmark to biases inherent in the sources accessed. The paper does not address potential limitations related to the representativeness of the synthesized questions or the generalizability of results across different domains.

Why it matters
EvoBrowseComp represents a significant advancement in the evaluation of search agents, providing a framework that adapts to the evolving landscape of knowledge and agent capabilities. This dynamic benchmarking approach is crucial for developing robust AI systems that can effectively navigate and reason with real-time information. The implications of this work extend to various applications in AI, particularly in enhancing the reliability and accuracy of search agents in practical scenarios, as published in arXiv cs.CL.

By Callan Zhang · Jun 11, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL