The Generalized Turing Test: A Foundation for Comparing Intelligence
Daniel Mitropolsky, Susan S. Hong, Riccardo Neumarker, Emanuele Rimoldi, Tomaso Poggio
- Published
- May 11, 2026 — 17:00 UTC
- Summary length
- 440 words
- Relevance score
- 70%
Problem
This paper presents the Generalized Turing Test (GTT), addressing a gap in the literature regarding the formal comparison of intelligence across arbitrary agents. Existing frameworks often rely on specific datasets or benchmarks, limiting their applicability. The GTT aims to provide a dataset- and task-agnostic method for evaluating the indistinguishability of agents, which is particularly relevant in the context of rapidly evolving AI systems. This work is a preprint and has not yet undergone peer review.
Method
The core technical contribution is the formal definition of the Turing comparator ( A \geq B ), which asserts that agent B cannot reliably distinguish between interactions with agent A (which is instructed to imitate B) and another instance of B. The authors explore the structural properties of this comparator, including conditions for transitivity that allow for the establishment of an ordering over equivalence classes of intelligence. Variants of the GTT are also defined, incorporating elements such as querying, bounded interaction, and fixed distinguishers. The empirical evaluation involves thousands of trials across modern models, assessing pairwise indistinguishability to validate the theoretical framework.
Results
The empirical results indicate a stratified structure in the comparisons, aligning with existing rankings of intelligence among the evaluated models. While specific numerical results are not disclosed in the abstract, the authors suggest that the GTT framework yields meaningful empirical orderings that reflect the relative capabilities of the agents tested. This suggests that the GTT can serve as a robust tool for evaluating intelligence across diverse AI systems, independent of traditional benchmarks.
Limitations
The authors acknowledge that the GTT’s reliance on indistinguishability may not capture all dimensions of intelligence, particularly those that are not easily quantifiable through interaction. Additionally, the framework’s transitivity conditions may not hold universally, potentially complicating the establishment of a comprehensive ordering. The paper does not address the computational complexity of implementing the GTT across a wide range of agents or the potential biases introduced by the choice of tasks and interactions used in the evaluation.
Why it matters
The introduction of the GTT has significant implications for the field of AI evaluation and benchmarking. By providing a formalized, task-agnostic framework for comparing intelligence, the GTT could facilitate more meaningful assessments of AI capabilities, guiding both research and development. Furthermore, the emphasis on indistinguishability as a unifying concept may inspire new training objectives that are less dependent on fixed datasets, potentially leading to more generalizable AI systems. This work lays the groundwork for future research into the nature of intelligence and its measurement, which is critical as AI continues to advance.
Authors: Daniel Mitropolsky, Susan S. Hong, Riccardo Neumarker, Emanuele Rimoldi, Tomaso Poggio
Source: arXiv:2605.10851
URL: https://arxiv.org/abs/2605.10851v1