Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv
- Published
- May 12, 2026 — 16:41 UTC
- Summary length
- 435 words
- Relevance score
- 80%
Problem
This paper addresses the limitations of existing visual latent reasoning methods in multimodal large language models (MLLMs), particularly the instability in performance due to a feature-space mismatch. Current approaches typically utilize an output-as-input paradigm, which can lead to unreliable latent feedback. The authors highlight that dominant visual-latent models often rely on pre-norm MLLMs, reusing decoder hidden states that differ significantly in norm from the input embeddings, thus creating a gap in alignment. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce the Granular Alignment Paradigm (GAP) for visual latent modeling, which aligns visual reasoning at three distinct levels:
- Feature-level alignment: A lightweight PCA-aligned latent head is employed to map decoder outputs into input-compatible visual latents, ensuring that the latent representations are more consistent with the model’s training data.
- Context-level alignment: This involves grounding latent targets with auxiliary visual supervision, allowing for a more interpretable and inspectable alignment process.
- Capacity-guided alignment: Latent supervision is selectively assigned to examples where the base MLLM exhibits performance difficulties, thereby optimizing the model’s learning capacity.
The model is trained on the Qwen2.5-VL 7B architecture, although specific details regarding the loss function, training compute, and dataset are not disclosed in the abstract.
Results
The GAP model demonstrates superior performance in mean aggregate perception and reasoning tasks compared to existing supervised variants. While specific numerical results are not provided in the abstract, the authors claim that their approach achieves the best performance among the tested models. Inference-time intervention probing indicates that the generated latents contribute meaningful visual signals relevant to the tasks, rather than merely serving as additional token slots.
Limitations
The authors acknowledge that their approach may still be sensitive to the choice of auxiliary visual supervision and the effectiveness of the PCA alignment. They do not discuss potential scalability issues or the generalizability of the model across different multimodal tasks. Additionally, the reliance on a specific architecture (Qwen2.5-VL 7B) may limit the applicability of the findings to other MLLM frameworks.
Why it matters
The introduction of the GAP paradigm has significant implications for the development of more stable and interpretable visual reasoning capabilities in MLLMs. By addressing the feature-space mismatch and enhancing alignment across multiple levels, this work paves the way for improved performance in tasks requiring visual understanding and reasoning. The findings could influence future research on multimodal integration, particularly in enhancing the robustness of visual-latent models and their applications in real-world scenarios.
Authors: Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang et al.
Source: arXiv:2605.12374
URL: https://arxiv.org/abs/2605.12374v1