Notable evaluation benchmarks

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pedro Conde, Henrique Branquinho, Valerio Mazzone, Bruno Mendes, André Baptista, Nuno Moniz

Published
May 11, 2026 — 16:50 UTC
Summary length
466 words
Relevance score
70%

Problem
This paper addresses the gap in the evaluation of AI pentesting agents, particularly the inadequacy of existing benchmarks that focus on predefined goals such as capture-the-flag or remote code execution. Current evaluation methods are limited to simplified environments and do not reflect the complexities of real-world pentesting scenarios, which require open-ended exploration and strategic decision-making. The authors present a preprint work that proposes a new evaluation protocol aimed at assessing the performance of pentesting agents in realistic settings.

Method
The authors introduce a novel evaluation protocol that transitions the focus from mere task completion to validated vulnerability discovery. This protocol incorporates several key components:

  • Structured Ground-Truth: A comprehensive dataset of vulnerabilities is used to provide a reliable reference for evaluation.
  • LLM-based Semantic Matching: Leveraging large language models to enhance the identification of vulnerabilities through semantic understanding.
  • Bipartite Resolution: A scoring mechanism that accounts for the inherent ambiguity in vulnerability findings, allowing for a more nuanced assessment.
  • Continuous Ground-Truth Maintenance: Ensures that the evaluation remains relevant as new vulnerabilities are discovered.
  • Stochastic Agent Evaluation: Facilitates repeated and cumulative assessments of agents, capturing their performance variability over time.
  • Efficiency Metrics: Introduces measures to evaluate the resource efficiency of pentesting agents.
  • Reduced-Suite Selection: A strategy for sustainable experimentation that minimizes the number of tests while maximizing information gain.
    The authors provide expert-annotated ground truth and code to ensure reproducibility of their evaluation framework.

Results
The paper does not disclose specific numerical results or performance metrics against named baselines, as it primarily focuses on the methodological contribution of the evaluation protocol. However, the authors claim that their approach allows for a more operationally informative comparison of AI pentesting agents, suggesting that it can yield insights into the effectiveness of various agents in discovering vulnerabilities across complex targets.

Limitations
The authors acknowledge that their protocol, while comprehensive, may still face challenges in fully capturing the dynamic nature of real-world environments. They do not address potential biases in the ground-truth dataset or the limitations of LLMs in accurately matching vulnerabilities. Additionally, the scalability of the protocol to a broader range of attack surfaces and the computational overhead associated with continuous ground-truth maintenance are not discussed in detail.

Why it matters
This work has significant implications for the field of offensive security and AI-driven pentesting. By providing a more realistic evaluation framework, it enables researchers and practitioners to better assess the capabilities of AI agents in identifying vulnerabilities in complex systems. This could lead to improved development of pentesting tools and methodologies, ultimately enhancing the security posture of organizations. The release of the evaluation protocol and ground truth data also promotes reproducibility and further research in this domain.

Authors: Pedro Conde, Henrique Branquinho, Valerio Mazzone, Bruno Mendes, André Baptista, Nuno Moniz
Source: arXiv:2605.10834
URL: https://arxiv.org/abs/2605.10834v1

Turing Wire
Author Turing Wire editorial staff