Notable foundation models

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri

Published
Jun 4, 2026 — 17:42 UTC

Problem
This paper addresses the limitations of existing multi-domain audio encoders, particularly the USAD and SPEAR models, which have restricted coverage and evaluation metrics. The authors highlight the need for a universal audio encoder that can effectively integrate knowledge from both self-supervised learning (SSL) and supervised models, especially as large language models (LLMs) increasingly depend on a single encoder for diverse audio inputs. The work is presented as a preprint, indicating it has not yet undergone peer review.

Method
USAD 2.0 introduces a novel architecture that employs domain-aware distillation to mitigate teacher-student mismatch during training. This approach allows the model to leverage both SSL and supervised foundation models effectively. The architecture is scaled to one billion parameters through depth scaling, enhancing its capacity to learn complex audio representations. The model incorporates a second-stage supervised distillation process, which fine-tunes the encoder for specific downstream tasks. The training dataset includes a diverse array of audio samples, covering various domains such as speech and music, to ensure comprehensive representation learning.

Results
USAD 2.0 demonstrates superior performance across multiple benchmarks, achieving state-of-the-art results in both probing tasks and evaluations involving LLMs. Specifically, it outperforms previous models by significant margins, although exact numerical improvements over named baselines are not disclosed in the abstract. The authors report that USAD 2.0 excels in tasks requiring general audio understanding, indicating its robustness across different audio domains.

Limitations
The authors acknowledge that while USAD 2.0 improves upon previous models, it may still face challenges in certain niche audio domains that require specialized encoders. Additionally, the reliance on large-scale datasets for training may limit accessibility for researchers with fewer resources. The paper does not address potential issues related to the computational cost of training such a large model or the implications of model size on deployment in real-world applications.

Why it matters
The development of USAD 2.0 has significant implications for the field of audio processing, particularly in enhancing the capabilities of LLMs that require robust audio understanding. By integrating SSL and supervised learning, this work paves the way for more versatile audio encoders that can be applied across various applications, from speech recognition to music analysis. The findings contribute to the ongoing discourse on the effectiveness of multi-domain models in audio processing, as discussed in related literature. This work is available on arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 4, 2026 · Editorial standards →

Source: arXiv cs.CL