Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Jie Zhang, Xiaoyue Chen, Anzhe Chen, Chenxu Lv, Deqing Li, Gengze Zhou
- Published
- Jun 15, 2026 — 17:52 UTC
Problem
This work addresses the gap in unified frameworks for embodied world modeling that leverage natural language for action representation. Existing models often lack the ability to predict future visual trajectories grounded in physical interactions across diverse robotic tasks. The authors present Qwen-RobotWorld as a preprint, indicating that it has not yet undergone peer review.
Method
Qwen-RobotWorld employs a three-part architecture:
- Double-Stream MMDiT with MLLM Action Encoding: This component utilizes a 60-layer double-stream diffusion transformer that integrates the frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention mechanisms. This design allows for effective coupling of language and visual data.
- Embodied World Knowledge (EWK): The model is trained on an extensive corpus comprising 8.6 million video-text pairs, encompassing over 200 million frames. This dataset includes action-language mappings across more than 20 embodiments and 500 action categories, providing a rich foundation for learning.
- General+Expert Progressive Curriculum: The training strategy is divided into two stages. The first stage focuses on learning general visual priors, while the second stage injects embodied specialization, all under a shared language interface. This progressive approach enhances the model’s adaptability to various tasks.
Results
Qwen-RobotWorld demonstrates superior performance across multiple benchmarks. It ranks first overall on both EWMBench and DreamGen Bench, outperforming all open-source models on WorldModelBench and PBench. The model also exhibits strong zero-shot generalization capabilities, as evidenced by its performance on the RoboTwin-IF benchmark, showcasing robust multi-view consistency. Specific performance metrics were not disclosed, but the authors emphasize the model’s competitiveness relative to existing state-of-the-art approaches.
Limitations
The authors acknowledge that while Qwen-RobotWorld shows promising results, it may still face challenges in real-world deployment, particularly in dynamic environments where unforeseen variables can affect performance. Additionally, the reliance on a large-scale dataset may limit the model’s applicability in scenarios with sparse data. The paper does not address potential biases in the training data or the implications of using a frozen language model, which may restrict adaptability to evolving language use.
Why it matters
The implications of Qwen-RobotWorld are significant for advancing embodied intelligence in robotics. By providing a unified framework that integrates language and visual prediction, this work opens avenues for synthetic data generation, scalable virtual environments for policy evaluation, and language-guided planning in robotic control. These advancements could enhance the efficiency and effectiveness of training robotic systems, as discussed in related literature on embodied AI and language integration, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →
Source: arXiv cs.CV