Notable evaluation benchmarks

Counterfactual Stress Testing for Image Classification Models

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker

Published
May 11, 2026 — 17:36 UTC
Summary length
417 words
Relevance score
80%

Problem
This preprint addresses the gap in evaluating the robustness of deep learning models in medical imaging, particularly in the context of distribution shifts due to variations in demographics, scanner hardware, or acquisition protocols. Current stress testing methodologies often utilize simplistic perturbations that do not reflect clinically relevant variations, leading to an overestimation of model robustness. The authors highlight the issue of underspecification, where models with similar validation performance can exhibit significantly different failure modes in real-world scenarios.

Method
The authors propose a counterfactual stress testing framework leveraging causal generative models to create realistic “what if” images. This framework allows for targeted interventions on specific attributes, such as scanner type and patient sex, while maintaining anatomical identity. The architecture of the generative model is not explicitly detailed, but it is implied that it incorporates causal inference principles to generate semantically meaningful variations. The evaluation is conducted across two imaging modalities—chest X-ray and mammography—utilizing three different model architectures. The training compute requirements are not disclosed, but the methodology emphasizes the generation of controlled images to assess model performance under various distribution shifts.

Results
The counterfactual stress testing framework demonstrates a significantly improved correlation with real out-of-distribution performance compared to traditional perturbation methods. The authors report that their approach captures both the direction and relative magnitude of performance changes across different models and scenarios. Specific performance metrics are not provided in the abstract, but the results indicate that the counterfactual method outperforms classical perturbations in model ranking and robustness assessment. This suggests a more reliable evaluation of medical AI systems prior to deployment.

Limitations
The authors acknowledge that their framework may not encompass all possible distribution shifts encountered in clinical practice, potentially limiting its generalizability. Additionally, the reliance on causal generative models necessitates accurate causal assumptions, which may not always hold in complex medical environments. The paper does not address the computational overhead associated with generating counterfactual images or the potential biases introduced by the generative model itself.

Why it matters
This work has significant implications for the deployment of AI systems in medical imaging, as it provides a more nuanced approach to robustness assessment. By utilizing causal generative models for stress testing, the framework offers a method to better predict real-world performance and failure modes, thereby enhancing the reliability of AI applications in clinical settings. This could lead to improved patient outcomes and more effective integration of AI technologies in healthcare.

Authors: Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker
Source: arXiv:2605.10894
URL: https://arxiv.org/abs/2605.10894v1

Turing Wire
Author Turing Wire editorial staff