Notable reasoning

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan

Original source

arXiv cs.CL

Problem
This paper addresses the instability of reinforcement learning (RL) in latent spaces, a gap in the literature that has not been thoroughly explored, particularly in the context of latent reasoning. Existing methods primarily focus on supervised learning, leaving a significant void in the application of RL techniques to latent reasoning. The authors highlight three critical bottlenecks: the absence of intrinsic latent manifolds, exploration-optimization misalignment, and latent mixture non-closure. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors propose Latent-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) tailored for latent reasoning. The architecture incorporates three key innovations:

Invalid-sample advantage masking: This technique mitigates the impact of invalid samples during policy updates.
One-sided noise sampling: This approach enhances exploration by introducing noise in a controlled manner, preventing the model from straying off the valid latent manifold.
Optimal correct-path first-token selection: This mechanism prioritizes the reinforcement of correct paths in the latent space, ensuring that updates are aligned with valid trajectories.

The training process leverages a combination of these strategies to stabilize learning in latent spaces, resulting in more efficient reasoning chains that are 3-4 times shorter than those used in explicit GRPO.

Results
Latent-GRPO demonstrates significant performance improvements across various benchmarks. On low-difficulty tasks, such as GSM8K-Aug, it achieves a 7.86 Pass@1 point increase over its latent initialization. In high-difficulty scenarios, it surpasses explicit GRPO by 4.27 points. Additionally, the model shows enhanced pass@$k$ performance when employing Gumbel sampling, indicating a robust capability in both low and high-complexity environments.

Limitations
The authors acknowledge that while Latent-GRPO addresses key challenges in latent reasoning, it may still be sensitive to the choice of hyperparameters and the specific structure of the latent space. They do not discuss potential scalability issues or the generalizability of the method to more complex tasks beyond the evaluated benchmarks. Furthermore, the reliance on specific sampling techniques may limit the applicability of the approach in diverse RL settings.

Why it matters
The introduction of Latent-GRPO has significant implications for the field of reinforcement learning, particularly in applications requiring efficient reasoning under uncertainty. By stabilizing the learning process in latent spaces, this work paves the way for more sophisticated models that can leverage latent representations for complex decision-making tasks. The findings could inspire further research into hybrid approaches that combine latent reasoning with other RL paradigms, potentially leading to advancements in areas such as natural language processing, robotics, and game playing.

Authors: Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan, Huawei Shen
Source: arXiv:2604.27998
URL: https://arxiv.org/abs/2604.27998v1