Notable agents robotics

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

Fernando López, Santosh Kesiraju, Jordi Luque

Published: Jun 4, 2026 — 14:20 UTC

Problem
Automatic Speech Recognition (ASR) systems have achieved significant performance improvements for standard speech; however, they struggle with pathological speech resulting from neurological conditions. This paper addresses the gap in ASR capabilities for recognizing pathological speech, which remains underexplored in the literature. The authors propose a method to condition ASR models on individual speakers’ characteristics without modifying the underlying model weights. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors introduce a speaker conditioning technique utilizing Feature-wise Linear Modulation (FiLM). This method involves injecting speaker-specific information, derived from x-vectors, into each transformer layer of a frozen ASR encoder. The approach allows for the adaptation of internal representations to individual pathological speakers while maintaining the integrity of the base model. The authors benchmark their method against standard fine-tuning and parameter-efficient adaptation strategies, employing both Spanish and English pathological speech datasets. The training process leverages a frozen ASR model, which is conditioned on speaker-specific features without requiring extensive retraining.

Results
The proposed FiLM-based speaker conditioning method demonstrates competitive performance against established adaptation strategies. Specifically, the authors report that their approach achieves a word error rate (WER) reduction of 15% compared to standard ASR models on pathological speech datasets. Additionally, the model retains its performance on non-conditioned speech, indicating that the conditioning does not adversely affect general ASR capabilities. The results suggest that the FiLM-based method is effective in enhancing the recognition of pathological speech while preserving the model’s ability to handle standard speech tasks.

Limitations
The authors acknowledge several limitations in their work. First, the reliance on a frozen ASR encoder may restrict the model’s ability to fully adapt to the complexities of pathological speech. Second, the evaluation is limited to Spanish and English datasets, which may not generalize to other languages or dialects. Furthermore, the paper does not explore the potential impact of varying the number of x-vectors or the specific characteristics of the pathological speech types included in the training data. These factors could influence the robustness and applicability of the proposed method in broader contexts.

Why it matters
This research has significant implications for the development of ASR systems tailored to pathological speech, which is crucial for improving communication aids for affected individuals. The ability to condition ASR models on speaker-specific features without altering the base model opens avenues for more personalized and effective speech recognition solutions. This work contributes to the ongoing efforts to enhance ASR technology for diverse speech patterns, as highlighted in related literature on speaker adaptation techniques. For further details, see the full paper available on arXiv.

By Callan Zhang · Jun 4, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL