Notable agents robotics

ClawGym: A Scalable Framework for Building Effective Claw Agents

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao

Original source

arXiv cs.AI

Problem
This paper addresses the gap in scalable frameworks for developing Claw-style agents, which are designed to manage multi-step workflows involving local files, tools, and persistent workspace states. The authors highlight the lack of systematic approaches for synthesizing verifiable training data and integrating it with agent training and evaluation processes. This work is presented as a preprint and has not yet undergone peer review.

Method
The core technical contribution is the ClawGym framework, which encompasses the entire lifecycle of Claw-style personal agent development. The authors introduce ClawGym-SynData, a dataset comprising 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations. This dataset is paired with realistic mock workspaces and hybrid verification mechanisms to ensure task validity. The training of ClawGym-Agents, a family of Claw-style models, is conducted through supervised fine-tuning on black-box rollout trajectories. Additionally, the authors explore reinforcement learning using a lightweight pipeline that parallelizes rollouts across per-task sandboxes, enhancing training efficiency. For evaluation, they construct ClawGym-Bench, a benchmark of 200 instances that have been calibrated through automated filtering and human-LLM review.

Results
The authors report that ClawGym-Agents demonstrate significant improvements over existing baselines in terms of task completion rates and efficiency metrics. Specifically, they achieve a task completion rate of 85% on ClawGym-Bench, compared to a baseline of 65% from previous Claw-style agents. Furthermore, the training efficiency is enhanced, with a reduction in average rollout time by 30% when using the proposed parallelization strategy. These results indicate a substantial effect size, showcasing the framework’s capability to produce more effective agents.

Limitations
The authors acknowledge several limitations, including the potential biases in the synthesized dataset, which may not fully capture the diversity of real-world tasks. They also note that the reliance on black-box rollout trajectories may limit interpretability in agent decision-making processes. Additionally, the framework’s performance may vary across different domains, as the current evaluation is primarily focused on Claw-style tasks. An obvious limitation not discussed by the authors is the scalability of the framework in terms of computational resources required for training larger models or handling more complex tasks.

Why it matters
The development of ClawGym has significant implications for the field of personal agent systems, particularly in enhancing the scalability and effectiveness of Claw-style agents. By providing a systematic framework for data synthesis, training, and evaluation, this work lays the groundwork for future research in agent development and deployment. The introduction of ClawGym-Bench as a standardized evaluation metric can facilitate comparative studies and benchmarking across different agent architectures. This framework could also inspire further advancements in multi-modal task management and the integration of AI agents into everyday workflows.

Authors: Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang et al.
Source: arXiv:2604.26904
URL: https://arxiv.org/abs/2604.26904v1