Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan
Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee
- Published
- Jun 8, 2026 — 17:29 UTC
Problem
This work addresses the challenge of neural machine translation (NMT) for low-resource Indigenous languages, specifically Q’eqchi’ Mayan, which suffers from extreme data scarcity. Traditional methods often rely on extractive web-scraping, which raises concerns about data sovereignty. The authors propose a novel approach to bootstrap NMT models without the need for scraping target-language parallel text, filling a significant gap in the literature regarding low-resource language processing. This paper is a preprint and has not yet undergone peer review.
Method
The authors introduce a data synthesis methodology that transforms community-sourced dictionaries into a large synthetic corpus for training NMT models. They employ Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA) on an mT5-base model. The training process leverages the synthetic corpus to teach the model complex agglutinative morphology and verb-object-subject (VOS) word order. The evaluation metrics include BLEU scores, with a focus on in-domain performance and structural acquisition.
Results
The model achieved a BLEU score of 42.02 in in-domain evaluations, indicating effective structural acquisition of the target language’s morphology and syntax. However, when evaluated against an organic glossary, the model’s performance dropped significantly to a BLEU score of 0.59, highlighting a structural-semantic gap. This discrepancy suggests that while the model can maintain grammatical integrity, it lacks the lexical grounding necessary for natural language use. The authors also conducted an ablation study using a Multi-Task Learning architecture, which resulted in negative transfer, indicating that auxiliary tasks competed for the limited parameter capacity of the LoRA adapters, leading to over-optimization for synthetic markers.
Limitations
The authors acknowledge that the model exhibits overfitting to the synthetic templates, which constrains its ability to generalize to the syntactic fluidity of natural language. They note that while synthetic bootstrapping serves as a structural primer, it necessitates authentic data for semantic refinement through Curriculum Learning. Additionally, the reliance on synthetic data may not fully capture the complexities of natural language, which could limit the model’s applicability in real-world scenarios.
Why it matters
This research has significant implications for the development of NMT systems for low-resource languages, demonstrating that synthetic data can effectively bootstrap models but also highlighting the necessity of authentic data for semantic grounding. The findings suggest a pathway for future work in low-resource NMT, emphasizing the importance of integrating real-world linguistic data to enhance model performance. This study contributes to the ongoing discourse on data sovereignty and the ethical considerations of language technology, as discussed in related works on low-resource language processing, as published in arXiv cs.AI.
By Turing Wire editorial staff · Jun 8, 2026 · Editorial standards →
Source: arXiv cs.AI