Notable multimodal

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang

Published
Jun 2, 2026 — 17:59 UTC

Problem
Current vision-language models (VLMs) exhibit limitations in spatial reasoning, particularly when critical information is not directly observable. This paper addresses the gap in the literature regarding imaginative perception, which involves inferring unseen viewpoints, tracing paths through occluded spaces, and integrating partial observations into coherent spatial representations. The authors present a preprint work that proposes a novel approach to enhance VLMs’ capabilities in these areas.

Method
The authors introduce Imaginative Perception Tokens (IPT), which serve as intermediate perceptual representations that externalize the potential perceptions of a VLM under various spatial configurations while remaining consistent with the observed input. The study formulates three specific tasks: Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC). To evaluate the effectiveness of IPT, the authors constructed datasets comprising approximately 20,000 examples, complete with ground truth imaginations and evaluation benchmarks. The backbone model utilized for experimentation is the unified VLM known as BAGEL. The IPT supervision is shown to consistently enhance spatial reasoning capabilities, outperforming traditional textual chain of thought training methods, even without generating images during inference.

Results
The implementation of IPT supervision resulted in a notable improvement in accuracy on the MVC task, achieving a 3.4% increase compared to baseline models. Additionally, the IPT approach demonstrated competitive performance on the PT task against strong closed-source models. The authors also discovered that combining IPT with label-only supervision yielded further performance gains, while reliance on textual chain of thought training often led to performance degradation, indicating a potential modality mismatch when spatial reasoning is constrained to language.

Limitations
The authors acknowledge that their approach may not generalize to all spatial reasoning tasks, particularly those requiring complex visual synthesis beyond the scope of the proposed IPT framework. They also note that the reliance on a specific backbone model (BAGEL) may limit the applicability of their findings to other architectures. Furthermore, the study does not explore the computational efficiency of IPT in real-time applications, which could be a critical factor for deployment in practical scenarios.

Why it matters
The introduction of IPT represents a significant advancement in the field of spatial reasoning within VLMs, providing a structured method for reasoning about unobserved spatial structures. This work has implications for various downstream applications, including robotics, augmented reality, and autonomous navigation, where understanding spatial relationships is crucial. The findings suggest that enhancing VLMs with imaginative perception capabilities can lead to better generalization and interpretability in spatial reasoning tasks, as discussed in the context of the broader literature on multimodal learning, as published in arXiv cs.AI.

Turing Wire

By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →

Source: arXiv cs.AI