Robust Spoofed Speech Detection via Temporal Pyramid Modeling
Mahtab Masoudi Nezhad, Nima Karimian
- Published
- Jun 15, 2026 — 15:16 UTC
Problem
The paper addresses the challenge of spoofed speech detection, particularly in the context of realistic synthesis, voice conversion, and replay attacks. A significant gap in the literature is the lack of effective cross-dataset generalization, which hampers the robustness of existing models. This work is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.
Method
The authors propose a novel architecture called the Temporal Pyramid Adapter, which employs parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues. This design allows the model to effectively analyze both local artifacts and global prosodic irregularities in speech. The model integrates self-supervised representations from XLS-R, combined with front-end adapters such as Mel and Sinc filters, alongside the Temporal Pyramid structure for enhanced temporal modeling. The training process and compute resources utilized are not explicitly disclosed, but the architecture’s complexity suggests a significant computational investment.
Results
The Temporal Pyramid model was evaluated across multiple benchmarks, including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Notably, it achieved an Area Under the Curve (AUC) of 99.24% and an Equal Error Rate (EER) of 3.87% on the PartialSpoof database. These results significantly outperform baseline models, including LCNN-BLSTM, which recorded an EER of 9.87%, and TRACE, with an EER of 8.08%. The multilingual evaluations further indicate that spoofing artifacts are independent of language, suggesting the model’s robustness across diverse linguistic contexts.
Limitations
The authors acknowledge that while self-supervised representations enhance robustness, performance declines under domain and language shifts, indicating a need for improved adaptation and calibration strategies. Additionally, the paper does not address potential overfitting to the training datasets or the generalizability of the model to unseen spoofing techniques. The reliance on specific architectures and representations may also limit applicability in resource-constrained environments.
Why it matters
This work has significant implications for the field of speech processing and security, particularly in enhancing the reliability of spoofed speech detection systems. The introduction of the Temporal Pyramid Adapter could pave the way for more robust models that generalize better across various datasets and conditions. Future research may build on these findings to develop adaptive strategies that mitigate performance degradation in diverse environments, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →
Source: arXiv cs.CV