Notable evaluation benchmarks OpenAI

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao

Original source

arXiv cs.CV

Problem
This paper addresses the lack of a comprehensive benchmark for evaluating the forensic analysis of AI-generated academic images, a gap that is particularly relevant given the rapid advancements in generative models. Existing benchmarks do not adequately capture the complexities and challenges specific to academic contexts, nor do they provide a holistic evaluation framework. The authors present AEGIS as a preprint, indicating that it has not yet undergone peer review.

Method
AEGIS introduces a structured benchmark that encompasses three primary innovations:

Domain-Specific Complexity: The benchmark includes seven academic categories with 39 fine-grained subtypes, which are designed to expose the intrinsic difficulties of forensic analysis. This complexity is illustrated by the performance of GPT-5.1, which achieves only 48.80% overall accuracy, and expert models that reach a limited Intersection over Union (IoU) of 30.09% for localization tasks.
Diverse Forgery Simulations: AEGIS simulates four prevalent forgery strategies across 25 generative models, revealing that 11 of these models yield average forensic accuracy below 50%. This highlights the lag in forensic capabilities relative to generative advancements.
Multi-Dimensional Forensic Evaluation: The benchmark assesses detection, reasoning, and localization in a unified framework. It demonstrates complementary strengths among different model families, with multimodal large language models (MLLMs) achieving 84.74% accuracy in textual artifact recognition, while expert detectors peak at 79.54% in binary authenticity detection.

Results
AEGIS evaluates 25 leading MLLMs and nine expert models, providing a diagnostic testbed that reveals fundamental limitations in academic image forensics. The results indicate that even state-of-the-art models struggle with the complexities of academic image forgery, with significant performance gaps across different tasks. The benchmark’s findings underscore the inadequacy of current forensic methods in keeping pace with generative technologies.

Limitations
The authors acknowledge that AEGIS is still in its early stages and may not encompass all possible forgery techniques or academic contexts. They also note that the performance metrics are contingent on the specific models evaluated, which may not generalize across all potential forensic applications. Additionally, the benchmark does not address the interpretability of model decisions, which is crucial for forensic applications.

Why it matters
The introduction of AEGIS has significant implications for the field of image forensics, particularly in academic settings where the integrity of visual data is paramount. By providing a structured and comprehensive evaluation framework, AEGIS can guide future research in developing more robust forensic techniques that can keep pace with advancements in generative models. This work encourages further exploration into the intersection of AI generation and forensic analysis, potentially leading to improved methodologies for detecting and mitigating the impact of AI-generated content in academic and other critical domains.

Authors: Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi et al.
Source: arXiv:2604.28177
URL: https://arxiv.org/abs/2604.28177v1