Efficient ASR Training with Conversations that Never Happened
Máté Gedeon, Péter Mihajlik
- Published
- Jun 2, 2026 — 17:46 UTC
Problem
The paper addresses the challenge of limited multi-speaker training data for conversational Automatic Speech Recognition (ASR) in lower-resource languages and niche domains. The authors highlight the scarcity of domain-matched data, which hampers the performance of ASR systems. This work is particularly relevant as it is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.
Method
The authors propose an augmentation pipeline that generates scenario-level dialogues with participant metadata. This involves mapping speaker attributes to Text-to-Speech (TTS) voice profiles and assembling synthesized utterances into speaker-aware simulated conversations. They evaluated five families of Large Language Models (LLMs) under various configurations: single-generator, fixed-budget mixture, and scale-up settings. All models were trained using the same FastConformer-Large architecture, with a consistent training recipe. The evaluation was conducted on the Hungarian BEA-Dialogue benchmark corpus, demonstrating the method’s applicability across languages, provided the necessary resources for each component are available.
Results
The results indicate that synthetic conversations significantly enhance speech recognition performance. The largest training configuration, which utilized 67 hours of real conversational data alongside 636 hours of simulated data, outperformed a zero-shot model trained on 2700 hours of Hungarian speech. Specifically, the synthetic data improved performance metrics on the BEA-Dialogue benchmark, although the extent of improvement varied based on the choice of generator and the composition of the training data. The authors provide quantitative results, showing that the integration of LLM-generated conversational data can yield substantial gains in ASR accuracy.
Limitations
The authors acknowledge that the effectiveness of the proposed method is contingent on the quality of the LLMs used and the diversity of the generated data. They also note that while the approach shows promise, it may not fully replace the need for real conversational data, particularly in capturing nuanced human interactions. Additionally, the reliance on TTS systems introduces potential limitations in voice quality and naturalness, which could affect the training outcomes. The paper does not address the computational costs associated with generating large volumes of synthetic data, which may be a barrier for some applications.
Why it matters
This work has significant implications for advancing ASR systems in low-resource settings, where acquiring real conversational data is often impractical. By demonstrating that LLM-generated synthetic conversations can effectively complement real data, the authors provide a viable pathway for improving ASR performance in underrepresented languages and domains. This approach could facilitate broader accessibility and usability of ASR technologies, as highlighted in related literature on data augmentation strategies for machine learning models, as published in arXiv cs.AI.
By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →
Source: arXiv cs.AI