Perceptual Flow Network for Visually Grounded Reasoning
Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen
- Published
- May 4, 2026 — 15:31 UTC
- Summary length
- 384 words
- Relevance score
- 90%
Problem
This paper addresses the limitations of existing Large-Vision Language Models (LVLMs) in visually grounded reasoning, particularly the issues of language bias and hallucination stemming from general optimization objectives like maximum likelihood estimation (MLE). The authors note that current methods rely on geometric priors from visual experts, which are often suboptimal due to their focus on geometric precision rather than reasoning utility. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce the Perceptual Flow Network (PFlowNet), which innovatively decouples perception from reasoning to create a self-conditioned generation process. This architecture allows for the integration of multi-dimensional rewards with vicinal geometric shaping through variational reinforcement learning. The self-conditioning mechanism enables the model to generate perceptual outputs that are not rigidly aligned with expert priors, thus enhancing the reasoning capabilities while maintaining visual reliability. The training process leverages reinforcement learning techniques, although specific details regarding the training compute and dataset sizes are not disclosed.
Results
PFlowNet achieves state-of-the-art (SOTA) performance on two benchmarks: V* Bench and MME-RealWorld-lite, with scores of 90.6% and 67.0%, respectively. These results surpass previous baselines, demonstrating a significant improvement in visually grounded reasoning tasks. The authors provide empirical evidence supporting the effectiveness of their approach, although specific comparisons to named baselines are not detailed in the abstract.
Limitations
The authors acknowledge that while PFlowNet improves upon existing methods, it may still be limited by the inherent biases present in the training data and the complexity of visual reasoning tasks. They do not address potential scalability issues or the computational overhead introduced by the reinforcement learning framework. Additionally, the reliance on self-conditioning may introduce challenges in generalization across diverse visual contexts.
Why it matters
The development of PFlowNet has significant implications for the field of visually grounded reasoning, particularly in applications requiring robust interaction between visual perception and language understanding. By moving away from rigid geometric priors, this work opens avenues for more flexible and interpretable models that can better handle the complexities of real-world visual reasoning tasks. The integration of reinforcement learning with perceptual shaping could inspire future research into adaptive learning strategies that enhance model performance in dynamic environments.
Authors: Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang et al.
Source: arXiv:2605.02730
URL: https://arxiv.org/abs/2605.02730v1