WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng
- Published
- May 13, 2026 — 17:59 UTC
Problem
This paper addresses the challenge of transcribing and translating endangered indigenous languages, specifically Wardaman, with extremely limited annotated data—only 6 hours of audio. The authors highlight the inadequacy of existing models that typically rely on large-scale datasets for tasks like transcription and translation, which are not feasible in this low-resource context. This work is presented as a preprint and has not yet undergone peer review.
Method
The core technical contribution of WARDEN is its two-stage architecture, which separates the tasks of transcription and translation. The transcription model converts Wardaman audio inputs into phonemic transcriptions, while the translation model converts these transcriptions into English. To enhance performance, the authors employ two key techniques:
- Transcription Initialization: The Wardaman transcription model is initialized using a Sundanese language model, leveraging phonemic similarities to accelerate fine-tuning.
- Domain-Specific Knowledge: A Wardaman-English dictionary, compiled from expert annotations, is integrated into a large language model (LLM) to inform the translation process, allowing it to utilize domain-specific knowledge effectively.
The training process is not disclosed in detail, but the authors emphasize the efficiency of their approach in low-data scenarios compared to traditional unified models.
Results
WARDEN demonstrates superior performance on transcription and translation tasks compared to larger open-source and proprietary models, despite the limited training data. The authors report that their system establishes a strong baseline for Wardaman to English translation, outperforming existing models that typically require extensive datasets. Specific quantitative results are not provided in the abstract, but the authors claim significant improvements over named baselines, indicating a robust effect size in the context of low-resource language processing.
Limitations
The authors acknowledge the inherent limitations of their approach, primarily stemming from the small size of the training dataset, which may restrict the generalizability of the model. Additionally, the reliance on phonemic similarities with Sundanese may not be applicable to all indigenous languages, potentially limiting the broader applicability of the method. The paper does not discuss potential biases introduced by the expert annotations used for the dictionary or the implications of using a LLM for translation.
Why it matters
This work has significant implications for the preservation and revitalization of endangered languages, providing a framework for developing transcription and translation systems in low-resource settings. By demonstrating that effective language processing can be achieved with minimal data, WARDEN paves the way for future research in low-resource language modeling and highlights the importance of leveraging domain-specific knowledge. This approach could inspire similar methodologies for other endangered languages, contributing to the broader field of natural language processing and linguistic preservation.
Authors: Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng
Source: arXiv:2605.13846
URL: https://arxiv.org/abs/2605.13846v1
By Turing Wire editorial staff · May 13, 2026 · Editorial standards →
Source: arXiv cs.AI