KletterMix: Climbing Toward High-Quality German Pretraining Data
Maurice Kraus, Ruben Härle, Sebastian Sztwiertnia, Abbas Goher Khan, Mehdi Ali, Michael Fromm
- Published
- Jun 2, 2026 — 15:28 UTC
Problem
The paper addresses the significant gap in high-quality pretraining data for German language models, which are often smaller, poorly curated, and inadequately validated compared to their English counterparts. The authors highlight that existing German datasets lack the scale and diversity necessary for effective model training, which hampers the performance of German NLP applications. This work is particularly relevant as it is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.
Method
KletterMix is constructed by translating a state-of-the-art English pretraining corpus into German, ensuring that document boundaries, metadata, source structure, and topical diversity are preserved. The authors employ a systematic approach to translation, utilizing advanced techniques to maintain the semantic and stylistic richness of the original documents. The dataset is characterized through extensive corpus-level analyses, including translation quality assessments using COMETKiwi, document length distributions, topic coverage, source composition, and geographic metadata. The authors also conduct controlled pretraining experiments and annealing ablations to evaluate the effectiveness of KletterMix against established German corpora.
Results
Models pretrained on KletterMix demonstrate significant improvements in downstream tasks compared to those trained on existing German datasets. Specifically, the authors report measurable enhancements in performance metrics, although exact numbers and specific baselines are not disclosed in the summary. The results indicate that KletterMix not only matches the scale of modern pretraining datasets but also retains high quality across diverse domains, as evidenced by the strong performance in evaluations.
Limitations
The authors acknowledge that while KletterMix represents a substantial advancement in German pretraining data, it is still reliant on the quality of the original English corpus. Potential biases inherent in the source material may carry over to the translated dataset. Additionally, the paper does not address the computational resources required for training models on KletterMix, which could be a barrier for some researchers. The lack of peer review also raises questions about the robustness of the findings.
Why it matters
KletterMix has the potential to significantly enhance the German NLP landscape by providing a high-quality, large-scale dataset for model pretraining. This work underscores the importance of curated translation efforts in expanding the capabilities of language models in underrepresented languages. The implications of this research extend to various downstream applications in German NLP, paving the way for improved model performance and broader accessibility of advanced language technologies. As such, this contribution is crucial for researchers and practitioners in the field, as published in arXiv.
By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →
Source: arXiv cs.CL