Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, Jinyang Wu, Jiajun Chai
- Published
- May 18, 2026 — 14:54 UTC
Problem
This paper addresses the limitations of existing tool-integrated reasoning (TIR) approaches in large language models (LLMs), which typically couple tool invocation with immediate execution. This coupling can disrupt reasoning coherence and limit expressivity, ultimately degrading performance in mathematical reasoning tasks. The authors formalize the problem of decoupling tool invocation from execution, introducing the concept of delayed execution with explicit control. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors propose the Implicit Hierarchical Generalized Reinforcement Policy Optimization (IH-GRPO) algorithm, which employs a hierarchical control framework to manage tool invocation and execution separately. The core technical contribution includes the derivation of a surrogate loss function that allows an implicitly hierarchical policy to learn behaviors akin to those of an explicit hierarchical policy. The architecture leverages reinforcement learning principles, although specific details regarding the training compute and dataset used are not disclosed. The framework is designed to enhance the reasoning capabilities of LLMs by allowing for more flexible and coherent tool interactions.
Results
The IH-GRPO algorithm demonstrates significant performance improvements across six out-of-domain mathematical reasoning benchmarks. Specifically, it achieves absolute gains of 1.87%, 2.16%, and 2.53% on models Qwen3-1.7B, Qwen3-4B, and Qwen3-8B, respectively, compared to the strongest baseline method. These results indicate that the proposed method not only enhances mathematical reasoning but also yields consistent performance gains in other domains, suggesting a robust applicability of the approach.
Limitations
The authors acknowledge that their approach may require extensive tuning of the hierarchical control parameters, which could limit its scalability across different tasks. Additionally, the reliance on surrogate loss functions may introduce complexities in training dynamics that are not fully explored in the paper. An obvious limitation not flagged by the authors is the potential for overfitting, particularly given the focus on out-of-domain benchmarks, which may not generalize well to in-domain tasks.
Why it matters
The decoupling of tool invocation from execution represents a significant advancement in the design of LLMs for reasoning tasks, particularly in mathematical contexts. By enhancing the expressivity and coherence of reasoning processes, this work opens avenues for more sophisticated interactions between LLMs and external tools. The implications extend to various applications, including automated theorem proving, complex problem-solving, and interactive AI systems, where maintaining reasoning integrity is crucial. The introduction of IH-GRPO could inspire further research into hierarchical reinforcement learning frameworks and their applications in other domains requiring tool integration.
Authors: Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, Jinyang Wu, Jiajun Chai, Wei Lin, Guojun Yin
Source: arXiv cs.CL
URL: https://arxiv.org/abs/2605.18500v1
arXiv ID: 2605.18500
By Turing Wire editorial staff · May 18, 2026 · Editorial standards →
Source: arXiv cs.CL