Self-Distilled Agentic Reinforcement Learning
Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu
- Published
- May 14, 2026 — 17:51 UTC
Problem
This paper addresses the limitations of traditional reinforcement learning (RL) in the context of long-horizon interactions, particularly for post-training large language model (LLM) agents. The authors identify that the trajectory-level reward signal in RL provides only coarse supervision, which is insufficient for multi-turn interactions. They highlight the challenges of applying On-Policy Self-Distillation (OPSD) in this setting, where multi-turn instability can destabilize supervision, and the need for asymmetric treatment of negative teacher rejections complicates skill-conditioned privileged guidance. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors propose Self-Distilled Agentic Reinforcement Learning (SDAR), which integrates OPSD as a gated auxiliary objective while maintaining RL as the primary optimization framework. SDAR employs a sigmoid gating mechanism to map detached token-level signals, enhancing the distillation process for teacher-endorsed positive-gap tokens and softening the impact of negative teacher rejections. The architecture leverages a teacher branch that provides dense token-level guidance augmented with privileged context. The training process involves optimizing both the RL objective and the gated distillation objective, allowing for a more stable learning process in multi-turn scenarios.
Results
SDAR demonstrates significant performance improvements over the Generalized Reinforcement Policy Optimization (GRPO) baseline across multiple benchmarks. Specifically, it achieves a +9.4% improvement on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop-Acc. These results indicate that SDAR not only outperforms naive GRPO combined with OPSD but also consistently exceeds hybrid RL-OPSD baselines across various model scales. The authors provide detailed comparisons against these baselines, showcasing the effectiveness of their approach in stabilizing training and enhancing performance.
Limitations
The authors acknowledge that while SDAR improves stability and performance, it may still be sensitive to the quality of the teacher model and the effectiveness of the gating mechanism. They do not address potential scalability issues related to the computational overhead introduced by the gating mechanism or the implications of using privileged context in diverse environments. Additionally, the paper does not explore the generalizability of the approach to other RL frameworks or tasks beyond those tested.
Why it matters
The introduction of SDAR has significant implications for the development of more robust and effective RL agents, particularly in complex, long-horizon tasks where traditional methods struggle. By enhancing the supervision provided by OPSD and addressing the instability issues associated with multi-turn interactions, this work paves the way for future research in agent-based learning systems. It opens avenues for further exploration of gated mechanisms in RL and the integration of self-distillation techniques, potentially leading to more capable and adaptable AI agents in real-world applications.
Authors: Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu et al.
Source: arXiv:2605.15155
URL: https://arxiv.org/abs/2605.15155v1
By Turing Wire editorial staff · May 14, 2026 · Editorial standards →
Source: arXiv cs.CL