Notable evaluation benchmarks

External Validation of Deep Learning Models for BI-RADS Breast Density Prediction from Ultrasound Images

Yuxuan Chen, Arianna Bunnell, Yanqi Xu, Haoyan Yang, Thomas K. Wolfgruber, John A. Shepherd

Published: May 6, 2026 — 16:19 UTC
Summary length: 434 words
Relevance score: 75%

Problem
This preprint addresses the gap in external validation of deep learning models for predicting breast density from ultrasound images, specifically focusing on the Breast Imaging Reporting and Data System (BI-RADS) categories. Prior studies primarily evaluated these models on internal datasets, raising concerns about their generalizability to diverse populations and clinical settings. The authors aim to assess the performance of three established architectures on an independent cohort, thereby contributing to the literature on the robustness of AI models in medical imaging.

Method
The study externally validated three deep learning architectures: DenseNet121, Vision Transformer (ViT-B/32), and ResNet50. The validation cohort consisted of 2,000 ultrasound exams, including 500 cancer cases and 1,500 matched negative controls. Performance was quantified using patient-level Area Under the Receiver Operating Characteristic (AUROC) across four BI-RADS density categories: A (fatty), B (scattered), C (heterogeneous), and D (extremely dense). The models were trained on a dataset of ultrasound images, although specific training compute details were not disclosed. For downstream risk assessment, the authors integrated AI-derived density with age into the Tyrer-Cuzick model and compared it against a reference model using age and mammography-reported density.

Results
The models demonstrated varying performance across density categories. The AUROC scores were as follows: extremely dense (0.868-0.899), fatty (0.814-0.838), scattered (0.764-0.799), and heterogeneously dense (0.699-0.729). DenseNet121 achieved the highest overall performance with a micro-averaged AUROC of 0.885. Notably, the performance metrics were consistent between internal and external testing, indicating good generalizability. For the risk modeling component, the combination of age and AI-derived density yielded an AUROC of 0.541, compared to 0.570 for age and mammography-reported density, with no statistically significant difference (p = 0.23).

Limitations
The authors acknowledge that while the models generalize well, performance in heterogeneously dense breasts remains suboptimal, indicating a need for further optimization. Additionally, the study’s reliance on a single external cohort may limit the generalizability of findings across different populations. The lack of detailed training compute specifications and the absence of a comprehensive analysis of model interpretability are also notable limitations.

Why it matters
This work has significant implications for the integration of AI in clinical workflows for breast cancer risk assessment. The demonstrated generalizability of deep learning models across diverse populations enhances confidence in their deployment in real-world settings. However, the challenges in accurately predicting breast density in heterogeneously dense cases highlight the necessity for targeted model improvements. Future research could focus on refining these models and exploring their integration into broader diagnostic frameworks, potentially improving early detection and patient outcomes.

Authors: Yuxuan Chen, Arianna Bunnell, Yanqi Xu, Haoyan Yang, Thomas K. Wolfgruber, John A. Shepherd, Yiqiu Shen
Source: arXiv:2605.05082
URL: https://arxiv.org/abs/2605.05082v1

Author Turing Wire editorial staff

Source

arXiv cs.CV https://arxiv.org/abs/2605.05082v1