Notable evaluation benchmarks

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

arXiv:2606.19184 Papers with Code →

Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid, Nesryne Mejri, Enjie Ghorbel

Published: Jun 17, 2026 — 15:24 UTC

Problem
The paper identifies a significant gap in the evaluation of deepfake detection systems, particularly in the context of domain shifts. Traditional metrics, such as the Area Under the ROC Curve (AUC), are often computed separately across multiple datasets, failing to capture the complexities of real-world scenarios where detectors encounter diverse data sources and varying artifact types. This limitation is particularly critical given the rise of sophisticated deepfake generation techniques, which necessitate robust evaluation methods. The authors note that this work is a preprint and has not undergone peer review.

Method
To address the identified shortcomings, the authors propose a new evaluation metric termed Cross-dataset AUC (Cross-AUC). This metric averages the AUC scores across different domains while incorporating a measure of prediction polarization, which quantifies the robustness of the detector to domain shifts. The polarization is assessed using the Wasserstein Distance, which measures the divergence between class score distributions. The authors conducted experiments on seven benchmark datasets, although specific details regarding the architecture of the deepfake detectors or the training compute used were not disclosed.

Results
The introduction of Cross-AUC demonstrated significant improvements in evaluating deepfake detectors compared to traditional AUC metrics. The paper reports that Cross-AUC provides a more nuanced understanding of detector performance under domain shifts, with specific numerical results indicating that detectors evaluated with Cross-AUC exhibited a more stable performance across datasets. While exact performance metrics against named baselines were not detailed in the abstract, the authors emphasize that Cross-AUC effectively captures the generalization capabilities of the detectors, particularly in scenarios with mixed data sources.

Limitations
The authors acknowledge that while Cross-AUC offers a more interpretable evaluation framework, it still relies on the quality and diversity of the datasets used for training and testing. Additionally, the paper does not address potential computational overhead associated with calculating the Wasserstein Distance, which may impact scalability in real-world applications. Furthermore, the reliance on existing datasets may limit the generalizability of the findings to novel deepfake generation techniques that were not included in the evaluation.

Why it matters
The development of Cross-AUC has significant implications for the field of deepfake detection, as it provides a more realistic assessment of detector performance in the face of domain shifts, which are common in practical applications. This work encourages future research to adopt more robust evaluation metrics that reflect real-world conditions, potentially leading to the development of more resilient detection systems. The findings and methodology presented in this paper can serve as a foundation for subsequent studies aimed at improving deepfake detection technologies, as published in arXiv cs.LG.

By Callan Zhang · Jun 17, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG