DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He
- Published
- Jun 15, 2026 — 17:52 UTC
Problem
This work addresses the inefficiency in reinforcement learning (RL) for deep research agents, particularly when using rubric-based rewards. Existing methods typically rely on large language models (LLMs) to generate rubrics from queries, which can lead to incomplete or misaligned rubrics that do not accurately reflect the information needs of the task. This paper presents a preprint that proposes a novel approach to construct reliable query-rubric pairs, thereby improving the efficiency of RL training for these agents.
Method
The authors introduce DeepRubric, a data construction framework that reverses the traditional rubric generation process. Instead of generating rubrics from queries, DeepRubric first identifies the evaluation criteria necessary for an evidence-backed report. It constructs an evidence tree by recursively expanding sub-questions related to a sampled seed topic, with the leaves of this tree serving as atomic evaluation targets. This structured approach ensures that the synthesized query-rubric pairs are aligned with the specific information needs of the task. The framework generates 9,000 query-rubric supervision examples, which are then used to train a model named DeepRubric-8B using a rubric-based Gradient Policy Reinforcement Optimization (GRPO) method.
Results
DeepRubric-8B demonstrates competitive performance against existing state-of-the-art deep research models across three benchmarks, achieving results comparable to prior models while utilizing approximately 13 times fewer GPU hours for RL training. The specific benchmarks and baseline models are not detailed in the abstract, but the efficiency gains are significant, indicating a substantial improvement in resource utilization without sacrificing performance.
Limitations
The authors acknowledge that the reliance on the initial seed topic may introduce biases in the evidence tree construction, potentially affecting the generalizability of the generated rubrics. Additionally, the paper does not address the scalability of the framework to more complex or diverse query types, nor does it explore the impact of varying the size and structure of the evidence tree on the quality of the rubrics. As a preprint, the work has not undergone peer review, which may also limit its current applicability.
Why it matters
The implications of this research are significant for the development of more efficient deep research agents capable of synthesizing high-quality reports. By providing a systematic method for generating reliable query-rubric pairs, DeepRubric enhances the training process for RL models, potentially leading to broader applications in automated content generation and information retrieval tasks. This work contributes to the ongoing discourse on improving RL methodologies in natural language processing, as discussed in related literature, and is available on arXiv.
By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →
Source: arXiv cs.CL