OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu
- Published
- May 12, 2026 — 17:56 UTC
- Summary length
- 436 words
- Relevance score
- 80%
Problem
This paper addresses the limitations of existing methods in joint audio-video generation, particularly in achieving high fidelity per modality, effective cross-modal alignment, and precise synchronization. The authors highlight that while Reinforcement Learning (RL) has potential in this domain, its application has not been thoroughly explored, especially concerning multi-objective and multi-modal generation. They identify three primary obstacles: (i) inconsistency in multi-objective advantages, (ii) imbalance in multi-modal gradients, and (iii) ineffective credit assignment for fine-grained cross-modal alignment. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce OmniNFT, a modality-aware online diffusion RL framework designed to overcome the identified challenges. The core technical contributions include:
- Modality-wise advantage routing: This mechanism routes independent advantages for each modality to their respective generation branches, allowing for more tailored optimization.
- Layer-wise gradient surgery: This technique selectively detaches gradients from the video branch on shallow audio layers, preventing interference while maintaining gradients for cross-modal interaction layers.
- Region-wise loss reweighting: This approach modulates the policy optimization process to focus on critical regions that are essential for audio-video synchronization and fine-grained alignment.
The framework utilizes the LTX-2 backbone and is trained on datasets JavisBench and VBench, although specific training compute details are not disclosed.
Results
OmniNFT demonstrates significant improvements over baseline models in terms of audio and video perceptual quality, cross-modal alignment, and synchronization. The paper reports that OmniNFT outperforms existing methods on both JavisBench and VBench, achieving a notable increase in performance metrics, although specific numerical results and effect sizes are not detailed in the abstract. The improvements suggest a robust enhancement in the quality of generated audio-video pairs, indicating the effectiveness of the proposed innovations.
Limitations
The authors acknowledge that their approach may still face challenges in scalability and generalization across diverse datasets and real-world scenarios. They do not address potential computational overhead introduced by the additional complexity of the proposed methods. Furthermore, the reliance on specific backbones (LTX-2) may limit the applicability of their findings to other architectures.
Why it matters
The implications of this work are significant for the field of multi-modal generation, particularly in applications requiring high fidelity and synchronization, such as virtual reality, gaming, and multimedia content creation. By addressing the shortcomings of existing RL approaches in this domain, OmniNFT paves the way for more effective and nuanced models that can handle the complexities of joint audio-video generation. This research could inspire further exploration into modality-aware frameworks and their applications in other multi-modal tasks.
Authors: Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue et al.
Source: arXiv:2605.12480
URL: https://arxiv.org/abs/2605.12480v1