AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao
- Published
- May 12, 2026 — 17:59 UTC
- Summary length
- 459 words
- Relevance score
- 80%
Problem
This paper addresses the limitations of existing multimodal models (UMMs) in performing advanced reasoning tasks without requiring a cold-start phase. Specifically, it focuses on enhancing the capabilities of UMMs in text-to-image generation and self-reflective refinement. The authors propose AlphaGRPO, a framework that integrates Group Relative Policy Optimization (GRPO) with UMMs to improve their performance in generating outputs that align with user intents. This work is presented as a preprint and has not yet undergone peer review.
Method
AlphaGRPO employs a novel reward mechanism termed Decompositional Verifiable Reward (DVReward). This approach decomposes complex user requests into atomic, verifiable semantic and quality questions, which are evaluated by a general Multimodal Large Language Model (MLLM). The architecture leverages GRPO to optimize the policy of the UMMs, allowing for enhanced reasoning capabilities without the need for additional training phases. The training compute details are not disclosed, but the framework is designed to provide stable supervision for real-world multimodal generation tasks. The self-reflective refinement process enables the model to autonomously identify and correct misalignments in its outputs, thereby improving the fidelity of generated content.
Results
AlphaGRPO demonstrates significant performance improvements across several multimodal generation benchmarks. On GenEval, TIIF-Bench, DPG-Bench, and WISE, the framework outperforms existing baselines, achieving robust enhancements in both generation and editing tasks. Notably, it shows substantial gains on GEdit without requiring specific training on editing tasks. The paper quantifies these improvements, although exact numerical results are not provided in the abstract. The effectiveness of the self-reflective reinforcement approach is validated through extensive experiments, indicating a marked increase in the model’s ability to generate high-fidelity outputs.
Limitations
The authors acknowledge that while AlphaGRPO improves multimodal generation, the reliance on a general MLLM for evaluating DVReward may introduce biases inherent to the MLLM’s training data. Additionally, the framework’s performance in highly complex or nuanced scenarios remains to be fully explored. The paper does not address potential scalability issues or the computational overhead introduced by the decompositional reward mechanism. Furthermore, the lack of peer review means that the findings should be interpreted with caution until validated by the community.
Why it matters
The implications of this work are significant for the field of multimodal AI, particularly in enhancing the reasoning capabilities of models that integrate text and image generation. By introducing a structured approach to reward decomposition, AlphaGRPO paves the way for more interpretable and reliable multimodal systems. This could lead to advancements in applications requiring nuanced understanding and generation, such as interactive AI systems, content creation, and automated editing tools. The self-reflective aspect of the framework also suggests a new direction for developing models that can autonomously improve their outputs, which is crucial for real-world deployment.
Authors: Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao
Source: arXiv cs.AI
URL: https://arxiv.org/abs/2605.12495v1