Major evaluation benchmarks Hugging Face

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu

Published
May 27, 2026 — 16:39 UTC

Problem
This preprint addresses a critical gap in understanding the capabilities of large language model (LLM)-based search agents, specifically whether they are genuinely utilizing external web resources for information retrieval or merely verifying pre-existing knowledge. The authors highlight the phenomenon of Intrinsic Knowledge Dependence (IKD), where agents rely heavily on internal knowledge rather than external evidence, raising concerns about the validity of static search benchmarks that may conflate memory-based verification with effective information retrieval.

Method
The authors introduce LiveBrowseComp, a novel benchmark designed to evaluate search agents’ performance in a more dynamic context. This benchmark consists of 335 human-authored questions that require answers based on facts published within the 90 days prior to the benchmark’s construction. The questions are sourced from six updated information repositories and are filtered to exclude globally salient events, ensuring that the answers necessitate recent knowledge. The evaluation metrics focus on closed-book accuracy and search-augmented performance, contrasting these with traditional benchmarks like BrowseComp. The study employs a series of diagnostics to assess agents’ reliance on intrinsic knowledge versus external evidence, revealing the extent of IKD.

Results
The findings indicate that LLM-based search agents answer up to 44.5% of questions on BrowseComp without utilizing external tools, demonstrating a significant reliance on intrinsic knowledge. Furthermore, over 50% of search queries generated by these agents stem from internally produced hypotheses rather than retrieved leads. When the supporting evidence is removed, agents perform worse than closed-book baselines, underscoring the limitations of static benchmarks. On LiveBrowseComp, all evaluated agents achieve less than 2% closed-book accuracy, with search-augmented scores dropping by 25-40 points compared to their performance on BrowseComp. This stark decline in performance suggests that prior model rankings are not reliable indicators of actual search capabilities.

Limitations
The authors acknowledge that their analysis is limited to the specific context of the LiveBrowseComp benchmark and may not generalize to all search tasks or domains. Additionally, the focus on recent information may not capture the full spectrum of an agent’s capabilities in handling diverse queries. The benchmark’s reliance on human-authored questions may introduce biases based on the authors’ knowledge and question framing. Furthermore, the study does not explore the underlying reasons for the observed IKD, leaving room for further investigation into the cognitive processes of LLMs during search tasks.

Why it matters
This work has significant implications for the development and evaluation of search agents, particularly in understanding their limitations in real-world applications. By highlighting the discrepancies between intrinsic knowledge and external evidence utilization, the authors advocate for the adoption of more dynamic benchmarks like LiveBrowseComp that better reflect the challenges faced by search agents in retrieving up-to-date information. This research encourages future work to focus on enhancing the retrieval capabilities of LLMs and developing more robust evaluation frameworks that account for the complexities of information search and verification.

Authors: HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu
Source: arXiv:2605.28721
URL: https://arxiv.org/abs/2605.28721v1

Turing Wire

By Turing Wire editorial staff · May 27, 2026 · Editorial standards →

Source: arXiv cs.AI