Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets
Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K.
- Published
- May 14, 2026 — 17:04 UTC
Problem
This paper addresses the gap in the generation of metadata for machine learning datasets, particularly in the context of local repositories that require governance and cannot rely on public platforms for data upload. While the Croissant metadata standard has gained traction, especially with its adoption by NeurIPS for dataset submissions, the existing methods for generating Croissant metadata are not suitable for large, governed datasets. This work presents Croissant Baker, a tool designed to facilitate local metadata generation, which is crucial for ensuring discoverability, governance, and reusability of datasets in environments where data cannot be publicly shared. The paper is a preprint and has not yet undergone peer review.
Method
Croissant Baker is an open-source command-line tool that generates Croissant-compliant metadata directly from a dataset directory. The architecture is modular, utilizing a handler registry that allows for extensibility and customization based on the dataset’s characteristics. The tool processes datasets locally, ensuring that sensitive or governed data remains within the local environment. The authors evaluated Croissant Baker on over 140 datasets, including the large-scale MIMIC-IV dataset, which contains 886 million rows and 374 Parquet files. The validation process involves comparing the generated metadata against ground truth derived from producer-authored metadata or established standards, achieving high accuracy in metadata generation.
Results
Croissant Baker demonstrated an impressive agreement rate of 97-100% when compared to ground truth metadata across various domains. This performance was consistent even with the large MIMIC-IV dataset, indicating the tool’s scalability and robustness. The results suggest that Croissant Baker can effectively generate high-quality metadata that meets the standards set by the Croissant framework, thereby facilitating better dataset management and discoverability.
Limitations
The authors acknowledge that while Croissant Baker achieves high accuracy, the evaluation is limited to the datasets tested, and further validation on a broader range of datasets is necessary to generalize the findings. Additionally, the tool’s performance may vary with datasets that have unique or complex structures not represented in the evaluation set. The paper does not address potential challenges in integrating Croissant Baker into existing data pipelines or the learning curve associated with its command-line interface.
Why it matters
The development of Croissant Baker has significant implications for the machine learning community, particularly in enhancing the governance and discoverability of datasets that are not suitable for public sharing. By enabling local metadata generation, the tool supports compliance with data governance policies while promoting the reuse of high-value datasets. This advancement is crucial as the demand for high-quality, well-documented datasets continues to grow in the field of machine learning, ultimately contributing to more reproducible and transparent research practices.
Authors: Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh et al.
Source: arXiv:2605.15079
URL: https://arxiv.org/abs/2605.15079v1
By Turing Wire editorial staff · May 14, 2026 · Editorial standards →
Source: arXiv cs.LG