Notable evaluation benchmarks OpenAI

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen

Published: May 25, 2026 — 17:50 UTC

Problem
This paper addresses the limitations of current large language model (LLM) agents as always-on personal assistants, which typically operate over narrow slices of user context. Existing benchmarks fail to capture the complexity of user interactions and the broader digital environment, limiting the evaluation of context-sensitive reasoning and proactive assistance capabilities. The authors present Claw-Anything, a new benchmark designed to evaluate agents in a more comprehensive and realistic setting. This work is a preprint and has not yet undergone peer review.

Method
The core technical contribution is the Claw-Anything benchmark, which expands the evaluation context for LLM agents along three dimensions: (1) long-horizon activity histories, (2) interdependent backend services, and (3) integrated graphical user interface (GUI) and command-line interface (CLI) interactions across multiple devices. To create this benchmark, the authors simulate months of user activity through a multi-round event injection process, generating complex world states that include realistic noise, such as irrelevant events and conflicting signals. This setup requires agents to reason over rich contextual environments while maintaining robustness to noise. Additionally, the authors provide an automated data-generation pipeline that produces 2,000 training environments, which enhances the base model’s performance by 23.7%.

Results
In experiments, the authors evaluate the performance of GPT-5.5 on the Claw-Anything benchmark, reporting a pass@1 rate of only 34.5%. This performance is significantly lower than that observed in prior benchmarks, indicating a substantial gap between current agent capabilities and the requirements for effective always-on personal assistance. The results highlight the challenges agents face in anticipating user needs and delivering timely recommendations in a complex digital landscape.

Limitations
The authors acknowledge several limitations, including the potential for the simulated user activity to not fully capture the diversity of real-world interactions. They also note that the benchmark may not account for all possible user scenarios, which could affect the generalizability of the results. Additionally, the reliance on a single model (GPT-5.5) for evaluation may limit insights into the performance of other architectures. An obvious limitation not discussed by the authors is the potential for overfitting to the specific training environments generated by their automated pipeline, which may not reflect the variability of real-world user behavior.

Why it matters
The introduction of the Claw-Anything benchmark has significant implications for the development of LLM agents as personal assistants. By providing a more comprehensive evaluation framework, it encourages the design of models that can operate effectively in complex, noisy environments and anticipate user needs. This work lays the groundwork for future research aimed at enhancing the capabilities of always-on personal assistants, ultimately leading to more effective and context-aware AI systems.

Authors: Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan et al.
Source: arXiv:2605.26086
URL: https://arxiv.org/abs/2605.26086v1

By Callan Zhang · May 25, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI