Self-Prophetic Decoding to Unlock Visual Search in LVLMs
Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang
- Published
- May 27, 2026 — 17:01 UTC
Problem
This preprint addresses the limitations of Large Vision-Language Models (LVLMs) in performing visual search tasks, specifically focusing on the incompatibility of intrinsic capabilities post-training and the interference encountered during long multi-step reasoning. The authors highlight that existing LVLMs struggle to maintain coherent reasoning across extended contexts, which hampers their effectiveness in practical applications.
Method
The core technical contribution is the introduction of SeProD (Self-Prophetic Decoding), a framework that integrates two novel insights. First, it employs self-regulation between pre- and post-training LVLMs, utilizing the single-step capabilities of the pre-trained model to counteract capability deterioration and mitigate long-context interference. Second, it introduces probability-based prophetic sampling, which replaces traditional prompting methods. In this approach, the pre-trained model acts as a “prophet,” generating tokens that the post-trained model selectively accepts based on its output distribution. This mechanism allows for coherent multi-step reasoning without requiring additional training or computational resources. The authors do not disclose specific training compute details, but emphasize the framework’s plug-and-play nature.
Results
SeProD demonstrates significant improvements across multiple visual search LVLMs, achieving enhanced performance on all 12 splits of four visual search benchmarks, as well as on general Visual Question Answering (VQA) benchmarks. The results indicate consistent performance gains, with effect sizes suggesting improvements of up to 15% in accuracy compared to baseline models that do not utilize SeProD. Notably, the framework operates without added computational overhead, which is a critical advantage for deployment in resource-constrained environments.
Limitations
The authors acknowledge that while SeProD improves coherence in multi-step reasoning, it may still be limited by the inherent capabilities of the underlying LVLMs. Additionally, the framework’s reliance on the pre-trained model’s performance could lead to suboptimal results if the pre-trained model is not sufficiently robust. The authors do not discuss potential issues related to the scalability of the prophetic sampling mechanism or its performance in highly complex visual search scenarios.
Why it matters
The implications of this work are significant for the development of more effective LVLMs capable of complex reasoning tasks. By addressing the challenges of capability deterioration and long-context interference, SeProD paves the way for more reliable visual search applications in real-world scenarios. This framework could serve as a foundation for future research aimed at enhancing multimodal reasoning capabilities in LVLMs, potentially influencing a wide range of applications from autonomous systems to interactive AI agents.
Authors: Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang
Source: arXiv:2605.28741
URL: https://arxiv.org/abs/2605.28741v1
By Turing Wire editorial staff · May 27, 2026 · Editorial standards →
Source: arXiv cs.CV