Notable evaluation benchmarks

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu

Published: May 20, 2026 — 17:59 UTC

Problem
This paper addresses the inadequacy of existing benchmarks for evaluating the capabilities of frontier language models in deep research tasks, which involve extensive evidence collection, cross-source reconciliation, and long-horizon reasoning. The authors argue that current benchmarks fail to differentiate the performance of advanced models effectively, as they do not capture the complexity of real-world deep research scenarios. This work is presented as a preprint and has not undergone peer review.

Method
The authors introduce DeepWeb-Bench, a benchmark designed to challenge frontier language models through three primary dimensions of difficulty: massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. The benchmark is structured around four capability families: Retrieval, Derivation, Reasoning, and Calibration. Each task in the benchmark is accompanied by a source-provenance record with four levels of disclosure, facilitating the auditing of model scores against the underlying evidence. The evaluation was conducted on nine frontier models, although specific architectures and training compute details are not disclosed in the paper.

Results
The evaluation of DeepWeb-Bench yielded several key findings:

Retrieval failures accounted for only 12-14% of total errors, indicating that retrieval is not the primary bottleneck in model performance.
Strong models predominantly exhibited errors related to incomplete derivation, while weak models were more prone to hallucinated precision errors.
There is significant specialization among models across different domains, with a cross-model agreement correlation of only ρ = 0.61 and per-case disagreement reaching 18.8 percentage points. These results highlight the nuanced performance characteristics of models when faced with complex tasks.

Limitations
The authors acknowledge that the benchmark may not encompass all possible scenarios encountered in deep research tasks, potentially limiting its generalizability. Additionally, the reliance on frontier models may skew results, as the benchmark has not been tested on a broader range of architectures. The authors do not discuss the potential impact of dataset biases or the scalability of the benchmark to larger models or different domains.

Why it matters
DeepWeb-Bench represents a significant advancement in the evaluation of language models for deep research applications, providing a more rigorous framework for assessing their capabilities. By highlighting the limitations of current benchmarks and offering a structured approach to evaluate complex reasoning tasks, this work lays the groundwork for future research in model development and evaluation. The findings regarding model specialization and error types can inform the design of more effective training regimes and architectures, ultimately enhancing the performance of language models in real-world applications.

Authors: Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia et al.
Source: arXiv:2605.21482
URL: https://arxiv.org/abs/2605.21482v1

By Callan Zhang · May 20, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI