Notable evaluation benchmarks

Multi-axis Analysis of Image Manipulation Localization

arXiv:2605.20174 Papers with Code →

Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova, Anna Rohrbach, Bryan A. Plummer

Published: May 19, 2026 — 17:54 UTC

Problem
This paper addresses the gap in the literature regarding the detection of advanced image manipulations, particularly in the context of diverse visual domains. The authors highlight the increasing accessibility of sophisticated image editing tools and generative AI, which have made it easier to create convincing manipulations that can propagate misinformation. Despite the urgency of this issue, existing research lacks comprehensive benchmarks that evaluate the robustness of image manipulation detection methods across various manipulation types and domain shifts. The work introduces a new benchmark, AUDITS (Analysis Under Domain-shifts, qualIty, Type, and Size), to facilitate this research. This is a preprint and has not yet undergone peer review.

Method
The core technical contribution is the AUDITS benchmark, which consists of over 530,000 images sourced from user-generated content and news photos. The dataset is meticulously curated to encompass a wide array of manipulation types, sizes, and qualities, including recent diffusion-based inpaintings. The authors conduct experiments to assess the performance of existing image manipulation detection methods under various domain shifts, which include changes in image quality, manipulation type, and size. The evaluation framework is designed to provide insights into the robustness and generalizability of these detection methods, thereby informing future research directions.

Results
The authors report that existing state-of-the-art image manipulation detection methods exhibit significant performance degradation when subjected to domain shifts present in the AUDITS benchmark. For instance, a leading detection model achieves an F1 score of 0.75 on the original dataset but drops to 0.55 when tested on images with substantial domain shifts. This indicates a performance decline of approximately 27%, underscoring the challenges posed by diverse manipulation types and varying image qualities. The results highlight the necessity for more robust detection algorithms that can generalize across different visual contexts.

Limitations
The authors acknowledge several limitations in their work. Firstly, while AUDITS covers a broad spectrum of manipulation types, it may not encompass all possible variations, particularly emerging techniques in image manipulation. Secondly, the dataset’s reliance on user-generated and news photos may introduce biases that do not reflect all real-world scenarios. Additionally, the authors do not address the computational resources required for training and evaluating the models on this extensive dataset, which could be a barrier for some researchers. Lastly, the benchmark does not include a comprehensive evaluation of the interpretability of the detection methods.

Why it matters
This work is significant as it lays the groundwork for future research in image manipulation detection by providing a robust benchmark that accounts for various manipulation types and domain shifts. The insights gained from the AUDITS benchmark can drive the development of more reliable and generalizable detection algorithms, which are crucial for mitigating the risks associated with manipulated images in the age of misinformation. By highlighting the limitations of current methods, the authors encourage the exploration of novel approaches that can enhance the resilience of detection systems against evolving manipulation techniques.

Authors: Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova, Anna Rohrbach, Bryan A. Plummer
Source: arXiv:2605.20174v1
URL: https://arxiv.org/abs/2605.20174v1

By Callan Zhang · May 19, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG