Notable other

Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu

Published
May 21, 2026 — 17:28 UTC

Problem
This preprint addresses the significant challenge of data scarcity in the domain of pediatric rare disease recognition, particularly in the context of developing computer vision systems for early diagnosis. The authors highlight the limitations imposed by privacy constraints and the reluctance to share data in pediatric settings, which severely restrict the availability of training data. While previous research has explored the augmentation of real datasets with synthetic data, it remains uncertain whether synthetic data alone can effectively support model training in ultra-low-resource environments. This study aims to fill this gap by investigating the efficacy of synthetic-only training for recognizing rare genetic diseases in children.

Method
The authors propose a controlled experimental setup where models are trained exclusively on synthetic facial images that are phenotype-aware. They utilize various backbone architectures, although specific architectures are not disclosed in the abstract. The training process involves scaling the synthetic dataset to evaluate performance across different model sizes. The core technical contribution lies in demonstrating that high-fidelity synthetic data can approximate clinically relevant distributions, thereby enabling models to learn effectively without any real data. The study emphasizes the importance of phenotype-aware generation of synthetic images to maintain semantic integrity.

Results
The findings indicate that models trained solely on synthetic data achieve performance metrics comparable to those trained exclusively on real data, provided that the synthetic dataset is sufficiently large. The authors report that across multiple backbone architectures, the synthetic-only models reach performance levels that are statistically indistinguishable from real-data-only baselines. While specific numerical results and benchmarks are not detailed in the abstract, the implication is that the effect sizes are significant enough to warrant further investigation into synthetic data applications in clinical settings.

Limitations
The authors acknowledge that their study is limited by the controlled nature of the experimental setup, which may not fully capture the complexities of real-world data distributions. Additionally, they do not address potential biases inherent in synthetic data generation, which could affect model generalization. The reliance on synthetic data raises questions about the representativeness of the generated images compared to actual patient data. Furthermore, the scalability of this approach in diverse clinical environments remains unexamined.

Why it matters
This research has substantial implications for the field of pediatric healthcare, particularly in enhancing the efficiency of data utilization for training diagnostic models. By demonstrating that synthetic data can serve as a viable alternative to real data, the study opens avenues for privacy-preserving resources in genetic education and counseling. This could facilitate improved clinician training and patient communication, ultimately leading to better diagnostic outcomes in pediatric rare diseases. The findings encourage further exploration of synthetic data methodologies, potentially transforming data scarcity challenges into opportunities for innovation in medical AI applications.

Authors: Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu, Wei Xu
Source: arXiv:2605.22767
URL: https://arxiv.org/abs/2605.22767v1

Turing Wire

By Turing Wire editorial staff · May 21, 2026 · Editorial standards →

Source: arXiv cs.CV