LLMSurgeon: Diagnosing Data Mixture of Large Language Models
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi
- Published
- May 28, 2026 — 17:59 UTC
Problem
This preprint addresses the lack of transparency in the pretraining data mixtures of Large Language Models (LLMs), which significantly influence their behaviors and failure modes. The authors highlight the challenge of post-hoc auditing of these data mixtures, as the composition is rarely disclosed. The work formalizes the concept of Data Mixture Surgery (DMS), aiming to estimate the domain-level distribution of a model’s pretraining corpus based solely on generated text, thus filling a critical gap in the literature regarding model interpretability and accountability.
Method
The core technical contribution is the LLMSurgeon framework, which approaches DMS as an inverse problem under the label-shift assumption. Instead of aggregating classifier outputs directly, LLMSurgeon estimates a calibrated soft confusion matrix. This matrix is used to correct systematic domain confusion and recover the latent mixture prior of the pretraining data. The authors also introduce LLMScan, an evaluation suite designed to verify the effectiveness of LLMSurgeon using open-source LLMs with known pretraining mixtures. The methodology emphasizes a constrained optimization approach to ensure fidelity in recovering domain mixtures.
Results
LLMSurgeon demonstrates high fidelity in recovering domain mixtures across various protocols in the LLMScan evaluation suite. While specific numerical results are not disclosed in the abstract, the authors claim that their method outperforms existing baselines in terms of accuracy and reliability in estimating the domain distributions. The evaluation is based on a set of open-source LLMs, which allows for a transparent comparison against known pretraining data mixtures, thus providing a robust benchmark for assessing the effectiveness of the proposed method.
Limitations
The authors acknowledge that LLMSurgeon relies on the label-shift assumption, which may not hold in all scenarios, potentially limiting its applicability. Additionally, the method’s performance may vary depending on the complexity of the domain taxonomy used for classification. The paper does not address the computational overhead associated with estimating the soft confusion matrix or the potential biases introduced by the choice of the predefined taxonomy. Furthermore, the reliance on generated text for inference may introduce noise, affecting the accuracy of the domain mixture recovery.
Why it matters
This work has significant implications for the field of model interpretability and accountability, particularly in the context of foundation models. By providing a practical, post-hoc auditing tool for analyzing the “digital DNA” of LLMs, LLMSurgeon enables researchers and practitioners to better understand the influences of pretraining data on model behavior. This transparency is crucial for responsible AI deployment, as it allows for the identification of biases and failure modes inherent in LLMs, ultimately guiding future research in model design and training practices.
Authors: Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen
Source: arXiv:2605.30348
URL: https://arxiv.org/abs/2605.30348v1
By Turing Wire editorial staff · May 28, 2026 · Editorial standards →
Source: arXiv cs.AI