Audio Interaction Model
Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma
- Published
- Jun 3, 2026 — 17:26 UTC
Problem
Current Large Audio Language Models (LALMs) operate in an offline manner, limiting their interactivity and responsiveness. Existing streaming audio models are typically designed for single tasks, such as automatic speech recognition (ASR) or voice chatting, which constrains their utility in dynamic environments. This paper addresses the gap by proposing a unified online LALM that can process audio inputs in real time, enabling a continuous perceive-decide-respond loop. The work is presented as a preprint, indicating it has not yet undergone peer review.
Method
The authors introduce the Audio Interaction Model, realized through a framework called Audio-Interaction. This model integrates offline task execution with online general audio instruction following. The key components include:
- SoundFlow Framework: This end-to-end framework facilitates the perceive-decide-respond loop, incorporating streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference.
- Data: The authors constructed a new dataset, StreamAudio-2M, comprising 2.6 million items that cover seven fundamental abilities and 28 sub-tasks, enabling diverse training scenarios.
- Training: The model employs a comprehension-aware training approach to enhance its ability to follow audio instructions in real time.
- Inference: The asynchronous low-latency inference mechanism is designed to ensure stable interactions during real-time audio processing.
Results
The Audio Interaction Model was evaluated across eight benchmarks, demonstrating competitive performance compared to existing mainstream audio tasks. Notably, it achieved significant improvements in real-time ASR and streaming audio instruction following. The proactive help capabilities were also highlighted, showcasing the model’s ability to intervene based on audio context. Specific performance metrics were not disclosed in the abstract, but the results indicate a clear advantage over traditional offline LALMs.
Limitations
The authors acknowledge that while the Audio Interaction Model enhances interactivity, it may still face challenges in handling complex multi-turn dialogues and understanding nuanced audio contexts. Additionally, the reliance on a large streaming corpus may limit the model’s applicability in low-resource settings. The paper does not address potential issues related to model scalability or the computational demands of real-time processing.
Why it matters
The introduction of the Audio Interaction Model represents a significant advancement in the field of audio processing and interaction, enabling more responsive and context-aware applications. This work lays the groundwork for future research in real-time audio understanding and interaction, potentially influencing the development of more sophisticated conversational agents and interactive systems. The implications of this research extend to various domains, including human-computer interaction and assistive technologies, as published in arXiv.
By Turing Wire editorial staff · Jun 3, 2026 · Editorial standards →
Source: arXiv cs.AI