Notable efficiency inference OpenAI

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

Published: Jun 8, 2026 — 14:18 UTC

Problem
This work addresses the significant performance gap in automatic speech recognition (ASR) for Dravidian languages when using multilingual models like Whisper, which are optimized for high-resource languages. The authors highlight that Dravidian languages exhibit longer word lengths, greater vocabulary diversity, and lower token repetition, leading to sparse token distributions and increased character-level substitution errors. The paper is a preprint and has not undergone peer review, indicating that the findings are preliminary.

Method
The authors propose two key enhancements to the Whisper decoder: Weighted-Attention and Self-Conditioning. Weighted-Attention adaptively balances the contributions of self-attention (which captures linguistic context) and cross-attention (which incorporates acoustic cues) to mitigate decoder imbalances observed during fine-tuning. Self-Conditioning involves reinjecting intermediate predictions back into the decoding process to enhance token consistency. The authors conducted experiments using a dataset of Dravidian languages, although specific details regarding the dataset size, training compute, and hyperparameters are not disclosed.

Results
The proposed methods resulted in consistent reductions in Word Error Rates (WER) for Dravidian and other low-resource languages. While specific numerical results are not provided in the summary, the authors report that their enhancements significantly outperform baseline models fine-tuned without these techniques. The improvements are particularly notable in the context of agglutinative languages, which typically present additional challenges for ASR systems.

Limitations
The authors acknowledge that their synthetic token-repetition experiments, which indicated potential gains, are impractical for real-world applications. Additionally, the study does not explore the scalability of the proposed methods across a broader range of languages or dialects, nor does it address the computational overhead introduced by the new attention mechanisms. The lack of extensive evaluation on diverse datasets may limit the generalizability of the findings.

Why it matters
This research has significant implications for the development of ASR systems in low-resource and agglutinative language contexts, potentially enabling better accessibility and usability of technology for speakers of these languages. The enhancements proposed could serve as a foundation for future work aimed at improving multilingual ASR performance, particularly in underrepresented linguistic groups. The findings contribute to the ongoing discourse on addressing biases in AI models, as highlighted in related literature (as published in arXiv cs.CL).

By Callan Zhang · Jun 8, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL