Notable other

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou

Published
Jun 15, 2026 — 17:30 UTC

Problem
This paper addresses the limitations of existing interactive world models in generating long-horizon video content with controllable camera navigation and event prompts. Prior models often lack the ability to revisit previously observed regions and exhibit significant style and color drift over extended sequences. The work is presented as a preprint, indicating that it has not yet undergone peer review.

Method
The authors propose DreamX-World 1.0, a general-purpose interactive model that integrates text/image inputs to generate videos across various domains, including photorealistic and stylized environments. The architecture employs a lightweight variant of projective positional encoding, termed E-PRoPE, which maintains projective camera geometry while utilizing camera-aware attention on spatially reduced tokens. The model transforms a bidirectional video generator into an autoregressive world model through techniques such as causal forcing, DMD-style distillation, and long-rollout training. This approach allows the model to learn from self-generated long-horizon contexts, mitigating style and color drift. Key innovations include Memory-Conditioned Scene Persistence for retrieving earlier views based on camera geometry and residual recycling to enhance conditioning robustness. Event Instruction Tuning facilitates composable event control, while reinforcement learning alignment is employed to recover visual quality and camera control post-distillation. The model is optimized for performance using mixed-precision DiT execution, 75%-pruned VAE decoding, and asynchronous pipeline parallelism, achieving up to 16 FPS on eight RTX 5090 GPUs.

Results
DreamX-World 1.0 demonstrates superior performance on a 5-second basic evaluation, achieving a camera-control score of 73.75 and an overall score of 84.76. This performance surpasses that of the baseline models HY-WorldPlay 1.5 and LingBot-World, which scored 80.79 and 80.45, respectively. The results indicate a significant improvement in both camera control and overall video generation quality, showcasing the effectiveness of the proposed methods.

Limitations
The authors acknowledge that the model’s reliance on self-generated contexts may introduce biases based on the initial training data. Additionally, while the model achieves high FPS rates, the computational requirements are substantial, necessitating multiple high-end GPUs for optimal performance. The paper does not address potential limitations in generalization across highly diverse or complex environments, which could affect the model’s applicability in real-world scenarios.

Why it matters
The advancements presented in DreamX-World 1.0 have significant implications for the development of interactive video generation systems, particularly in applications requiring dynamic scene manipulation and event-driven narratives. The integration of camera control and event instruction tuning could enhance user experience in gaming, virtual reality, and simulation environments. This work contributes to the ongoing discourse in the field of generative models and interactive AI, as published in arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →

Source: arXiv cs.CV