RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu
- Published
- Jun 11, 2026 — 17:59 UTC
Problem
This work addresses the limitations of existing world action models (WAMs) that rely on reconstruction-oriented video tokenizers derived from pretrained video generation models. While these models maintain visual fidelity, they inadequately support the learning of instruction-following dynamics necessary for effective robot control. The authors identify a gap in the literature regarding the integration of semantic understanding in visual-action modeling, which is crucial for connecting future predictions with actionable robot trajectories. This paper is a preprint and has not undergone peer review.
Method
The core contribution of this research is the development of RepWAM, which employs a representation visual-action tokenizer to create a semantic visual-action latent space. This tokenizer maps visual inputs into aligned visual and latent action tokens, facilitating a more meaningful representation of the environment. The WAM is pretrained to jointly model future visual states and the latent actions that connect them, guided by language instructions. The training process involves adapting the model to real robot trajectories for closed-loop manipulation tasks. The architecture leverages a combination of visual and action token embeddings, although specific details regarding the model architecture, loss functions, and training compute are not disclosed in the abstract.
Results
RepWAM demonstrates superior performance in real-world manipulation tasks and simulation benchmarks compared to traditional reconstruction-oriented WAMs. The authors report significant improvements in task completion rates and efficiency metrics, although specific numerical results and baseline comparisons are not detailed in the abstract. Ablation studies indicate that the semantic visual-action tokenization approach outperforms conventional methods, underscoring its effectiveness in enhancing the model’s ability to follow instructions and execute actions.
Limitations
The authors acknowledge that the reliance on a semantic visual-action latent space may introduce challenges in scenarios with highly variable or unstructured environments. Additionally, the model’s performance in extreme edge cases or novel tasks remains untested. The abstract does not provide insights into the computational efficiency or scalability of the approach, which are critical for real-world applications.
Why it matters
The introduction of representation visual-action tokenization represents a significant advancement in the field of world action modeling, paving the way for more robust and generalizable robot policies. By focusing on semantic understanding rather than mere pixel reconstruction, RepWAM enhances the ability of robots to interpret and act upon complex instructions in dynamic environments. This work lays the groundwork for future research in generalist robot policies and multimodal instruction-following systems, as discussed in related literature. For further details, see the full paper available on arXiv.
By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →
Source: arXiv cs.CV