NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation
Daichi Azuma, Taiki Miyanishi, Koya Sakamoto, Shuhei Kurita, Yaonan Zhu, Petr Khrapchenkov
- Published
- Jun 11, 2026 — 15:44 UTC
Problem
The paper addresses the limitations of existing navigation world models, which typically serve as prediction modules requiring external planners for closed-loop control. This gap is particularly evident in goal-conditioned visual navigation tasks, where robots must operate under partial observability. The authors propose a novel approach, NavWAM, to directly convert visual foresight into executable actions, thereby eliminating the need for separate planning mechanisms. This work is presented as a preprint, indicating it has not yet undergone peer review.
Method
NavWAM employs a diffusion-transformer architecture that integrates future observations, goal-progress values, and action chunks into a shared latent sequence. The model is trained to predict future states while simultaneously learning the action and value targets that dictate closed-loop behavior. This joint learning framework allows for the direct application of visual foresight in robot control. The training process involves simulation pretraining followed by real-robot adaptation, ensuring that the model can generalize from simulated environments to real-world scenarios. Specifics regarding the training compute and dataset used are not disclosed in the paper.
Results
NavWAM was evaluated on image-goal navigation tasks, demonstrating significant improvements over planning-based world models and a representative direct navigation policy. In closed-loop real-robot deployment, NavWAM achieved a success rate of 85% compared to 75% for the best-performing planning-based baseline. Additionally, in offline benchmarks, NavWAM outperformed the direct navigation policy by a margin of 10% in success rates across various scenarios. These results indicate a substantial enhancement in navigation performance, showcasing the effectiveness of integrating visual foresight into action execution.
Limitations
The authors acknowledge that while NavWAM shows promising results, it may still struggle in highly dynamic environments where rapid changes in the surroundings could affect the accuracy of predictions. Additionally, the reliance on simulation pretraining may introduce a domain gap when transitioning to real-world applications, which could affect performance. The paper does not address potential computational overhead associated with the diffusion-transformer architecture, which may limit scalability in resource-constrained settings.
Why it matters
The introduction of NavWAM represents a significant advancement in the field of goal-conditioned visual navigation, as it allows robots to leverage visual foresight directly for action execution. This approach could pave the way for more autonomous robotic systems capable of navigating complex environments without the need for extensive planning. The implications of this work extend to various applications, including autonomous vehicles and robotic assistants, where real-time decision-making is crucial. The findings contribute to the ongoing discourse on integrating predictive models with action policies in robotics, as discussed in related literature, and are available on arXiv.
By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →
Source: arXiv cs.CV