Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang
- Published
- Jun 16, 2026 — 17:59 UTC
Problem
The paper addresses the limitations of existing unified multimodal models that typically utilize separate visual tokenizers, which fragment the representation space and impede seamless integration of visual understanding and generation. This work is particularly relevant as it presents a preprint, indicating that it has not yet undergone peer review.
Method
The authors propose UniAR, a unified autoregressive framework that employs a single discrete visual tokenizer to bridge the gap between visual understanding and generation. The architecture incorporates a pretrained vision encoder that utilizes multi-level feature fusion and a lookup-free bitwise quantization scheme. This design preserves both high-level semantics and low-level details while effectively scaling the visual vocabulary with minimal computational overhead. The model employs parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, which significantly reduces the visual sequence length and accelerates the generation process. A diffusion-based visual decoder is then used to decode the discrete visual tokens into high-fidelity images. The training regimen includes large-scale pre-training, followed by supervised fine-tuning and reinforcement learning to optimize performance.
Results
UniAR achieves state-of-the-art performance in image generation and editing tasks, outperforming established baselines on relevant benchmarks. Specific performance metrics include a notable improvement in image quality and coherence compared to models like DALL-E and CLIP, although exact numerical results are not disclosed in the abstract. The model also maintains competitive performance on multimodal understanding tasks, indicating its versatility across different applications.
Limitations
The authors acknowledge that while UniAR demonstrates significant advancements, it may still face challenges in scalability and efficiency when applied to extremely large datasets or in real-time applications. Additionally, the reliance on a single visual tokenizer may limit the model’s adaptability to diverse visual contexts. The paper does not address potential biases in the training data or the implications of using a single tokenizer on the model’s generalization capabilities.
Why it matters
The introduction of UniAR represents a significant step towards truly unified multimodal models, facilitating more coherent interactions between visual understanding and generation. This work has implications for various downstream applications, including image synthesis, editing, and multimodal AI systems. The approach of using a single visual tokenizer could inspire future research to explore more integrated architectures in multimodal learning, as discussed in related works on unified models (as published in arXiv cs.CV). The project page provides additional resources and insights into the implementation and performance of UniAR.
By Callan Zhang · Jun 16, 2026 · Editorial standards →
Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.
Source: arXiv cs.CV