Notable training methods

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao, Yucheng Zeng

Published: May 6, 2026 — 16:44 UTC
Summary length: 443 words
Relevance score: 80%

Problem
This preprint addresses the inefficiencies in stateful trajectory sampling in reinforcement learning (RL) environments, particularly in SWE-bench-style agentic RL. The authors identify that substantial computational resources are wasted on sampled rollout groups with skewed pass rates, where binary rewards yield a weak contrastive signal. The paper posits that achieving a 50% pass rate is optimal for maximizing reward entropy and enhancing the contrastive structure of success and failure, thus framing the issue as a pass-rate control problem.

Method
The core technical contribution is the introduction of Prefix Sampling (PS), a method designed to steer rollout groups toward the optimal 50% pass rate. PS operates by replaying trajectory prefixes: successful prefixes are used to provide advantageous starting points for groups that predominantly fail, while failing prefixes are employed as handicaps for groups that mostly succeed. This approach allows for the reconstruction of prefix states through replay, ensuring that replayed tokens are excluded from the loss function, thereby restricting optimization to continuations generated by the current policy. The authors evaluate PS on SWE-bench-style agentic RL, specifically on models Qwen3-14B and Qwen3-32B, reporting significant end-to-end wall-clock speedups.

Results
The implementation of Prefix Sampling yields notable performance improvements. On the Qwen3-14B model, the SWE-bench Verified peak performance increases from a baseline of 0.273 to 0.295, demonstrating a clear enhancement in efficiency and effectiveness. The method achieves end-to-end wall-clock speedups of 2.01x for Qwen3-14B and 1.55x for Qwen3-32B, while maintaining or improving final verified performance metrics. Additionally, experiments on the AIME 2025 benchmark corroborate the pass-rate control pattern, attributing performance gains to factors such as replay efficiency, bidirectional coverage, and adaptive control mechanisms.

Limitations
The authors acknowledge that their approach may not generalize across all RL environments, particularly those with different reward structures or state dynamics. They do not address potential scalability issues when applied to larger or more complex environments beyond those tested. Furthermore, the reliance on binary rewards may limit the applicability of PS in scenarios where richer reward signals are available.

Why it matters
This work has significant implications for the design of more efficient RL algorithms, particularly in environments where computational resources are constrained. By demonstrating that steering pass rates can enhance learning efficiency, the findings encourage further exploration of reward structures and sampling strategies in RL. The proposed method could lead to advancements in various applications, including robotics, game playing, and other domains where agentic behavior is critical. The insights gained from this research may also inform future studies on optimizing reward signals and trajectory sampling in RL frameworks.

Authors: Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao, Yucheng Zeng, Jingnan Gu, Daxiang Dong et al.
Source: arXiv:2605.05112
URL: https://arxiv.org/abs/2605.05112v1

Author Turing Wire editorial staff

Source

arXiv cs.LG https://arxiv.org/abs/2605.05112v1