Notable reasoning

What is Holding Back Latent Visual Reasoning?

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

Published: May 18, 2026 — 14:14 UTC

Problem
This preprint addresses a significant gap in the understanding of latent visual reasoning within Vision-Language Models (VLMs). While recent research has explored the use of continuous latent tokens for chain-of-thought reasoning, the authors find that these tokens do not contribute meaningfully to model accuracy. This raises questions about the causal role of latent tokens in predictions and highlights a lack of effective utilization of these tokens in existing datasets.

Method
The authors conduct a series of experiments to analyze the impact of latent tokens on model performance. They replace informative latent tokens with uninformative “dummy” tokens and observe that model accuracy remains unchanged, indicating that the latent tokens do not play a significant role in the final predictions. They investigate the training signal from oracle latent representations and the quality of latent tokens generated during inference. The study emphasizes the need for high-quality datasets that provide informative intermediate steps and the importance of improving the precision of latent token predictions. The authors also introduce a diagnostic dataset where latent tokens are designed to offer substantial support for predictions, demonstrating that models can effectively utilize these tokens when they are informative.

Results
The findings reveal that existing datasets often do not provide sufficient information through oracle latent tokens, leading models to ignore them during training. When fine-tuned on the diagnostic dataset, models show a marked improvement in their ability to rely on latent tokens for predictions. The authors do not provide specific numerical results or effect sizes against named baselines, but they emphasize the qualitative improvements observed when models are trained with informative latent tokens.

Limitations
The authors acknowledge that the reliance on oracle latent representations is limited by the quality of existing datasets, which often do not facilitate effective learning. They also note that the latent tokens generated at inference time tend to collapse to a narrow region, which diminishes their utility. An additional limitation not explicitly mentioned is the potential overfitting to the diagnostic dataset, which may not generalize to broader applications. Furthermore, the study does not explore the implications of varying model architectures on the effectiveness of latent tokens.

Why it matters
This work has significant implications for the future development of VLMs and latent visual reasoning. By identifying the shortcomings in current datasets and the need for more informative intermediate representations, the authors provide a roadmap for enhancing model performance in visual reasoning tasks. The emphasis on high-quality datasets and precise latent token prediction could lead to breakthroughs in how models understand and simulate complex visual scenarios, ultimately improving their reasoning capabilities in real-world applications.

Authors: André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann
Source: arXiv:2605.18445
URL: https://arxiv.org/abs/2605.18445v1

By Callan Zhang · May 18, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL