Major agents robotics UiPath

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang

Published: May 12, 2026 — 17:57 UTC
Summary length: 424 words
Relevance score: 85%

Problem
This paper addresses the challenge of optimal path selection in hybrid action spaces for Computer Use Agents (CUAs), which can perform both atomic GUI actions and high-level tool calls. The authors highlight a significant gap in the literature regarding the scarcity of high-quality interleaved GUI-Tool trajectories, which complicates the decision-making process for CUAs. The work is presented as a preprint and has not yet undergone peer review.

Method
The authors propose ToolCUA, an end-to-end agent that employs a staged training paradigm to learn optimal GUI-Tool path selection. The methodology consists of three main components:

Interleaved GUI-Tool Trajectory Scaling Pipeline: This component repurposes existing static GUI trajectories and synthesizes a grounded tool library, enabling the generation of diverse GUI-Tool trajectories without the need for manual engineering or the collection of real tool trajectories.
Tool-Bootstrapped GUI Reinforcement Fine-Tuning (RFT): This approach combines warmup Supervised Fine-Tuning (SFT) with single-turn Reinforcement Learning (RL) to enhance decision-making at critical points where the agent must switch between GUI actions and tool calls.
Online Agentic Reinforcement Learning: ToolCUA is further optimized in a high-fidelity GUI-Tool environment using a Tool-Efficient Path Reward, which incentivizes the agent to utilize tools appropriately and to minimize execution paths.

Results
ToolCUA achieves an accuracy of 46.85% on the OSWorld-MCP benchmark, representing a relative improvement of approximately 66% over the baseline models. This performance establishes a new state of the art among models of comparable scale. Additionally, ToolCUA demonstrates a 3.9% improvement over settings that utilize only GUI actions, indicating effective orchestration between GUI and tool usage.

Limitations
The authors acknowledge several limitations, including the reliance on synthetic data for training, which may not fully capture the complexities of real-world interactions. Additionally, the performance metrics are evaluated in a controlled environment, which may not generalize to all practical applications. The paper does not address potential scalability issues when deploying ToolCUA in more complex or varied environments.

Why it matters
The implications of this work are significant for the development of more capable digital agents that can navigate complex tasks involving both GUI and tool interactions. By demonstrating that training in a hybrid action space can lead to improved performance, this research paves the way for future work in optimizing agent decision-making processes. The findings suggest that integrating diverse action modalities can enhance the efficiency and effectiveness of CUAs, which is crucial for applications in automation, user assistance, and intelligent systems.

Authors: Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan et al.
Source: arXiv:2605.12481
URL: https://arxiv.org/abs/2605.12481v1

Author Turing Wire editorial staff

Source

arXiv cs.AI https://arxiv.org/abs/2605.12481v1