TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao
- Published
- Jun 9, 2026 — 17:16 UTC
Problem
The paper addresses the limitations of existing reinforcement learning with verifiable rewards (RLVR) approaches, particularly in multi-turn scenarios where reward contrast is insufficient. Traditional methods focus on allocating rollout resources to promising prompts but fail to consider the variation in informativeness across turns within a single rollout. This gap is critical as it leads to low-variance feedback and suboptimal policy updates. The work is presented as a preprint, indicating it has not yet undergone peer review.
Method
The authors propose TRACE (Tree Rollout Allocation for Contrastive Exploration), a unified framework that models each turn in a ReAct-style thought-action-observation sequence as a distinct node in a tree structure. This allows for budget allocation not only at the prompt root level but also at intermediate prefixes, enhancing the granularity of the rollout process. TRACE employs a shared generalizable predictor that estimates the conditional success probability of these nodes based on historical prefix data, guiding the allocation of the rollout budget to those nodes most likely to yield diverse terminal rewards. The framework is designed to operate within a fixed sampling budget, optimizing the exploration of the action space while enriching the feedback received from outcome-only rewards.
Results
Empirical evaluations demonstrate that TRACE significantly improves performance on standard agentic benchmarks. Notably, it enhances the average accuracy of the Qwen3-14B Multi-Hop QA model by 2.8 points compared to competitive baselines, all while maintaining equal sampling costs. This improvement indicates a substantial increase in the efficiency of the policy optimization process, showcasing TRACE’s effectiveness in enhancing reward contrast and policy-update signals.
Limitations
The authors acknowledge that TRACE’s performance may be contingent on the quality of the predictor used for estimating conditional success probabilities. Additionally, the framework’s reliance on tree-structured rollouts may introduce computational overhead, which could limit scalability in more complex environments. The paper does not address potential challenges in integrating TRACE with existing RL frameworks or the implications of varying rollout budgets across different tasks.
Why it matters
TRACE represents a significant advancement in the field of reinforcement learning, particularly in enhancing agentic behavior through improved reward contrast. By enabling more efficient budget allocation across multi-turn interactions, it opens avenues for more sophisticated RL applications in natural language processing and beyond. The implications of this work are profound, as it could lead to more capable AI systems that better understand and navigate complex decision-making scenarios. This research is crucial for future developments in RLVR and is available on arXiv.
By Turing Wire editorial staff · Jun 9, 2026 · Editorial standards →
Source: arXiv cs.AI