Major evaluation benchmarks null

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

Published: Jul 3, 2026 — 16:14 UTC

The UK’s AI Security Institute (AISI) conducted a study analyzing seven standard benchmarks used to evaluate AI agents, revealing that these benchmarks systematically underestimate the true capabilities of AI systems. The research highlights that the limitations imposed by capping the compute budget lead to a significant underrepresentation of agent performance, particularly in software engineering tasks.

AISI found that when the token budget was increased tenfold, success rates for AI agents improved by approximately 25%. This finding indicates that newer models, which are designed to leverage larger compute resources, benefit the most from increased token budgets. The study suggests that the actual progress at the frontier of AI capabilities is about 60% steeper than previously measured, emphasizing the need for revised evaluation metrics that account for the full potential of AI agents. This research underscores the importance of adapting benchmarks to reflect the evolving landscape of AI technology and its capabilities. For further details, refer to the original article on The Decoder.

By Callan Zhang · Jul 3, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: The Decoder