Major evaluation benchmarks Hugging Face

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller

Original source

arXiv cs.CV

https://arxiv.org/abs/2604.26772v1

Problem
This preprint addresses the gap in the literature regarding the application of various vision foundation models (VFMs) for the detection of AI-generated images (AIGIs). While previous work has leveraged CLIP vision transformers for this task, the potential of newer VFMs, which incorporate architectural advancements and diverse training paradigms, has not been thoroughly explored. The authors aim to benchmark these models’ performance in detecting both fully-generated and AI-inpainted images, thereby expanding the understanding of their utility in AI image forensics.

Method
The authors conduct a comprehensive evaluation of multiple VFMs, assessing their performance on AIGI detection tasks. They introduce a novel classifier head redesign termed Tunable Attention Pooling (TAP), which aggregates output tokens into a refined global representation. This method enhances the model’s ability to leverage the rich feature sets provided by modern VFMs. The study evaluates models across various pretraining objectives, input resolutions, and scales, systematically measuring their out-of-the-box performance. The training compute details are not disclosed, but the focus is on the architectural modifications and the integration of TAP with the latest VFMs.

Results
The proposed TAP-enhanced models demonstrate significant improvements in AIGI detection accuracy. Specifically, the best-performing model surpasses the original CLIP by over 12% in accuracy on established benchmarks. The authors report that their approach sets a new state-of-the-art on two challenging datasets for in-the-wild detection of AI-generated and AI-inpainted images, outperforming existing methodologies. The specific benchmarks and baseline models used for comparison are not detailed in the abstract, but the performance gains are quantitatively significant, indicating a robust advancement in the field.

Limitations
The authors acknowledge that their work primarily focuses on the performance of VFMs in AIGI detection without exploring the interpretability of the TAP mechanism or the potential for adversarial attacks against the models. Additionally, the study does not address the computational efficiency of the TAP method compared to traditional pooling techniques. There may also be limitations related to the generalizability of the findings across different generative models not included in the benchmark.

Why it matters
This work has significant implications for the field of AI image forensics, particularly as the prevalence of AI-generated content increases. By demonstrating the effectiveness of modern VFMs and the TAP mechanism, the authors provide a pathway for future research to explore more sophisticated detection methods. The findings encourage further investigation into the capabilities of VFMs beyond their original design intentions, potentially leading to enhanced tools for combating misinformation and ensuring the integrity of visual media.

Authors: Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller
Source: arXiv:2604.26772
URL: https://arxiv.org/abs/2604.26772v1

Published
Apr 29, 2026 — 15:03 UTC
Summary length
419 words
AI confidence
80%