AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang
- Published
- Jun 8, 2026 — 17:55 UTC
Problem
The paper addresses the limitations of existing world-action models in robot manipulation, which couple world prediction and action execution at the same temporal resolution. This coupling leads to inefficiencies, as the world branch must model near-term frame variations that are often redundant and provide limited information. The authors propose a solution to this problem by introducing a preprint work, AHA-WAM, which leverages temporal asymmetry to improve the modeling of visual scene dynamics and actions.
Method
AHA-WAM employs a dual Diffusion Transformer (DiT) architecture, consisting of a low-frequency world planner and a high-frequency action executor. The world planner utilizes a rolling key-value memory to maintain past observations and encodes long-horizon scene evolution through reusable layerwise latent context. In contrast, the action executor operates in a closed-loop manner, executing short action chunks while querying the context via layerwise joint attention. The model incorporates horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR) to facilitate asynchronous execution, allowing the action expert to leverage long-horizon world context while remaining responsive to real-time execution states without the need to rerun the video DiT.
Results
AHA-WAM demonstrates state-of-the-art performance on the RoboTwin benchmark, achieving an average success rate of 92.80%. In real-world manipulation tasks, it attains a success rate of 78.3% across four tasks. Notably, AHA-WAM achieves a closed-loop control frequency of 24.17 Hz, representing a 4.59x speedup over the Fast-WAM baseline. These results indicate significant improvements in both efficiency and effectiveness in robot manipulation tasks.
Limitations
The authors acknowledge that the proposed model is a preprint and has not undergone peer review, which may affect the robustness of the findings. Additionally, while the model shows impressive performance, it may still be sensitive to the quality of the input data and the specific configurations of the training environment. The reliance on a dual architecture may also introduce complexity in deployment and tuning for specific applications.
Why it matters
The introduction of AHA-WAM has significant implications for the field of robot manipulation, particularly in enhancing the efficiency and effectiveness of world-action models. By decoupling world prediction from action execution, this work opens avenues for more sophisticated and responsive robotic systems capable of handling complex tasks in dynamic environments. The findings contribute to the ongoing discourse on improving embodied control in robotics, as published in arXiv.
By Turing Wire editorial staff · Jun 8, 2026 · Editorial standards →
Source: arXiv cs.AI