Major agents robotics Alibaba

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu

Published: May 28, 2026 — 17:36 UTC

Problem
This paper addresses the fragmentation in embodied intelligence research, where existing models are typically tailored for specific tasks such as manipulation or navigation. This specialization limits generalization across diverse tasks, environments, and robot embodiments. The authors propose a unified framework, Qwen-VLA, to integrate these heterogeneous decision-making problems into a single vision-language-action model. This work is presented as a preprint and has not yet undergone peer review.

Method
Qwen-VLA builds upon the Qwen architecture, extending its vision-language capabilities to encompass continuous action and trajectory generation through a DiT-based action decoder. The model is trained using a large-scale joint pretraining approach that incorporates a variety of data sources, including:

Robotics manipulation trajectories
Human egocentric demonstrations
Synthetic simulation data
Vision-and-language navigation data
Trajectory-centric supervision
Auxiliary vision-language datasets

To accommodate multiple robot platforms, the authors introduce embodiment-aware prompt conditioning, which utilizes robot-specific textual descriptions to define the current embodiment and control conventions. The framework unifies manipulation, navigation, and trajectory prediction tasks, enabling transferable visual grounding, spatial reasoning, and continuous action generation across different robot morphologies and environments.

Results
Qwen-VLA demonstrates strong performance across various benchmarks, achieving the following results:

97.9% on LIBERO
73.7% on Simpler-WidowX
86.1% on RoboTwin-Easy and 87.2% on RoboTwin-Hard
69.0% Out-of-Sample Rate (OSR) on R2R
59.6% Success Rate (SR) on RxR
76.9% average out-of-distribution (OOD) success in real-world ALOHA experiments
26.6% zero-shot success on DOMINO dynamic manipulation

These results indicate consistent multi-task performance and robust generalization capabilities under varying conditions, such as scene layout, background, lighting, object configuration, and robot embodiment.

Limitations
The authors acknowledge that while Qwen-VLA shows promising results, it may still face challenges in extreme out-of-distribution scenarios and complex real-world environments that were not fully represented in the training data. Additionally, the reliance on large-scale pretraining may limit accessibility for researchers with fewer resources. The paper does not address potential biases in the training datasets or the implications of embodiment-aware prompt conditioning on model interpretability.

Why it matters
The development of Qwen-VLA has significant implications for advancing embodied AI by providing a unified framework that enhances generalization across tasks and environments. This model could facilitate the deployment of robotic systems capable of performing a wider range of tasks with minimal retraining, thereby accelerating progress in robotics and AI applications. The integration of vision, language, and action in a single model also opens avenues for future research in multi-modal learning and decision-making.

Authors: Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang et al.
Source: arXiv:2605.30280
URL: https://arxiv.org/abs/2605.30280v1

By Callan Zhang · May 28, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI