Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading
Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez
- Published
- Jun 8, 2026 — 15:50 UTC
Problem
This work addresses the gap in silent speech synthesis (SSI) technologies, particularly the integration of surface electromyography (sEMG) and video-based lipreading for continuous speech synthesis. Existing multimodal approaches often overlook robustness to modality degradation or sensor failures, which limits their practical application. The authors highlight that this is an unreviewed preprint, indicating the need for further validation in the research community.
Method
The authors propose a masked multimodal speech synthesis framework that utilizes both sEMG and lipreading signals. The core technical contribution involves a training strategy that incorporates modality masking, allowing the model to learn to synthesize speech even when one modality is compromised. The architecture specifics are not disclosed, but the training process emphasizes the importance of masking strategies over traditional data augmentation techniques. The model is evaluated under multispeaker conditions, which enhances its generalizability. The training compute requirements are not specified, but the focus on robustness suggests a potentially high computational demand to accommodate the dual-modality input.
Results
The proposed framework achieves a reduction in word error rate (WER) by up to 14 absolute percentage points compared to the strongest unimodal baseline. This significant improvement demonstrates the effectiveness of the masked multimodal approach. The authors also report that the masking strategies contribute to enhanced performance under low-bitrate conditions, outperforming degradation-specific data augmentations. Phone-level analyses indicate that the integration of modalities provides complementary benefits, particularly for vowels and certain consonant groups, although specific quantitative results for these analyses are not detailed.
Limitations
The authors acknowledge that while their approach shows promise, adaptation to laryngectomized speakers remains an unresolved challenge. Additionally, the study does not explore the long-term robustness of the model in real-world applications, nor does it address potential biases in the training data that could affect generalization. The lack of detailed architecture and training compute specifications may also limit reproducibility.
Why it matters
This research has significant implications for the development of assistive technologies for individuals with speech impairments, particularly those who cannot produce vocal sounds. By demonstrating the effectiveness of masked multimodal integration, the findings pave the way for more robust SSI systems that can function effectively in real-world scenarios. The work encourages further exploration into the adaptation of these systems for diverse user populations, as highlighted in related literature on multimodal speech synthesis, such as that found in arXiv.
By Turing Wire editorial staff · Jun 8, 2026 · Editorial standards →
Source: arXiv cs.CL