Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang
- Published
- May 21, 2026 — 16:45 UTC
Problem
This paper addresses the instability and suboptimal convergence issues prevalent in Reinforcement Learning with Verifiable Rewards (RLVR), a paradigm critical for scaling large language model (LLM) reasoning. The authors identify a significant gap in the literature regarding the limitations imposed by hard clipping in clipping-based GRPO-style objectives, which discards potentially informative signals located just beyond the clipping threshold. This work is presented as a preprint and has not yet undergone peer review.
Method
The core technical contribution is the introduction of the Near-boundary Stochastic Rescue (NSR) mechanism. NSR is a minimal, plug-and-play modification that employs stochastic perturbations to retain tokens that are marginally out-of-bounds, thereby recovering lost signals that would otherwise be discarded by hard clipping. The authors argue that this stochastic sampling approach induces an implicit gradient decay in expectation, which is more effective than traditional deterministic gradient decay methods. The experiments were conducted across various model sizes (7B to 30B parameters) and architectures, including both dense and mixture of experts (MoE) configurations. The training compute details are not explicitly disclosed.
Results
NSR demonstrates substantial improvements in training stability and performance across multiple benchmarks. The authors report consistent gains over strong baselines such as DAPO and GSPO, although specific numerical results and effect sizes are not detailed in the abstract. The experiments indicate that NSR effectively mitigates the clipping bottleneck, leading to enhanced convergence rates and overall model performance.
Limitations
The authors acknowledge that while NSR provides a significant improvement, it may not fully eliminate all forms of instability inherent in RLVR setups. They do not discuss the potential computational overhead introduced by stochastic sampling or the scalability of NSR to larger models beyond 30B parameters. Additionally, the reliance on stochastic methods may introduce variability in training outcomes, which could be a concern for reproducibility.
Why it matters
The implications of this work are substantial for the field of reinforcement learning and LLMs. By addressing the clipping bottleneck, NSR offers a novel approach to enhance the stability and performance of RLVR systems, which are increasingly important for complex reasoning tasks. This research opens avenues for further exploration of stochastic methods in RL and could lead to more robust training paradigms for future LLMs, potentially influencing both theoretical and practical advancements in the domain.
Authors: Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding et al.
Source: arXiv:2605.22703
URL: https://arxiv.org/abs/2605.22703v1
By Turing Wire editorial staff · May 21, 2026 · Editorial standards →
Source: arXiv cs.LG