Major multimodal

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu

Published: May 12, 2026 — 17:59 UTC
Summary length: 447 words
Relevance score: 80%

Problem
This paper addresses a significant gap in the literature regarding the integration of understanding and generation in multimodal models. Existing large vision-language models (VLMs) typically treat these tasks as separate, leading to fragmented architectures and misaligned representation spaces. The authors argue that this separation is a structural limitation that impedes the development of native multimodal intelligence. The work is presented as a preprint and has not yet undergone peer review.

Method
The authors introduce SenseNova-U1, a unified multimodal architecture based on the NEO-unify framework, which treats understanding and generation as synergistic aspects of a single process. Two variants are proposed: SenseNova-U1-8B-MoT, which utilizes a dense architecture with 8 billion parameters, and SenseNova-U1-A3B-MoT, which employs a mixture-of-experts approach with 30 billion parameters (30B-A3B). The models are designed from first principles, incorporating detailed strategies for model design, data preprocessing, pre-training, post-training, and inference. The training compute specifics are not disclosed, but the architecture aims to rival leading understanding-only VLMs across various tasks, including text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence.

Results
SenseNova-U1 demonstrates competitive performance against top-tier understanding-only VLMs on multiple benchmarks. The models excel in tasks such as any-to-image (X2I) synthesis, complex infographic generation, and interleaved vision-language generation, achieving strong semantic consistency and visual fidelity. While specific numerical results are not provided in the abstract, the authors claim that their models outperform existing baselines in both understanding and generation tasks, indicating a significant effect size in multimodal performance metrics. Preliminary evaluations also suggest that SenseNova-U1 extends its capabilities to vision-language-action (VLA) and world model (WM) scenarios, indicating a broader applicability beyond traditional multimodal tasks.

Limitations
The authors acknowledge that their work is still in the early stages, with preliminary evidence supporting the extended capabilities of their models. However, they do not provide extensive empirical validation across diverse datasets or real-world applications, which could limit the generalizability of their findings. Additionally, the computational efficiency and scalability of the mixture-of-experts model in practical applications remain unaddressed. The lack of peer review also raises questions about the robustness of their claims.

Why it matters
The introduction of SenseNova-U1 represents a paradigm shift in multimodal AI, moving away from the traditional separation of understanding and generation towards a unified framework. This approach has significant implications for future research, as it suggests that models can inherently develop capabilities across modalities without the need for explicit translation mechanisms. The findings could pave the way for more integrated systems that exhibit native multimodal intelligence, enhancing applications in areas such as robotics, interactive AI, and complex decision-making tasks.

Authors: Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye et al.
Source: arXiv:2605.12500
URL: https://arxiv.org/abs/2605.12500v1

Author Turing Wire editorial staff

Source

arXiv cs.CV https://arxiv.org/abs/2605.12500v1