Notable multimodal

WavFlow: Audio Generation in Waveform Space

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong

Published: May 18, 2026 — 17:59 UTC

Problem
This paper addresses the limitations of existing audio generation methods that rely on latent-space compression, which can introduce complexity and potential information loss. The authors propose WavFlow, a novel framework that generates high-fidelity audio directly in the raw waveform space, circumventing the need for intermediate representations. This work is presented as a preprint and has not yet undergone peer review.

Method
WavFlow employs a unique approach to audio generation by reshaping audio signals into 2D token grids through a process termed “waveform patchify.” This transformation allows the model to handle high-dimensional and low-energy signals more effectively. To facilitate stable optimization, the authors introduce an amplitude lifting technique that aligns signal scales, enabling direct x-prediction in flow matching. The model is trained on a curated dataset comprising 5 million high-quality video-text-audio triplets, which enhances its ability to learn fine-grained acoustic patterns from scratch. The architecture details, including specific layers or flow structures, are not disclosed in the abstract.

Results
WavFlow demonstrates competitive performance on two benchmarks: VGGSound and AudioCaps. On VGGSound, it achieves a Frame Discriminative score (FD_PaSST) of 59.98, an Inception Score (IS_PANNs) of 17.40, and a DeSync score of 0.44. For AudioCaps, the model reports an FD_PANNs of 10.63 and an IS_PANNs of 12.62. These results indicate that WavFlow matches or exceeds the performance of established latent-based methods, showcasing its effectiveness in generating high-quality audio without intermediate compression.

Limitations
The authors acknowledge that while WavFlow offers a simpler alternative to latent-based methods, it may still face challenges in scalability and generalization across diverse audio generation tasks. Additionally, the reliance on a large curated dataset raises questions about the model’s performance in low-resource settings or with less structured data. The paper does not discuss potential computational costs associated with training on such a large dataset or the model’s inference efficiency.

Why it matters
The implications of this work are significant for the field of audio generation and multimodal learning. By demonstrating that high-quality audio synthesis can be achieved without intermediate compression, WavFlow opens new avenues for research in direct waveform generation. This could lead to more efficient models that are easier to train and deploy, particularly in applications requiring real-time audio synthesis or those constrained by computational resources. Furthermore, the automated data pipeline for curating multimodal triplets may inspire future work on dataset generation and augmentation strategies in audio and video domains.

Authors: Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang et al.
Source: arXiv:2605.18749
URL: https://arxiv.org/abs/2605.18749v1

By Callan Zhang · May 18, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV