Notable agents robotics

Reinforcement Learning with Robust Rubric Rewards

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo

Published: May 28, 2026 — 17:11 UTC

Problem
This paper addresses the limitations of existing Reinforcement Learning with Verifiable Rewards (RLVR) frameworks, particularly in the context of vision-language tasks that require multi-criteria supervision. While RLVR is effective for tasks with deterministic checkability, many real-world applications necessitate a more nuanced approach to reward verification. The authors propose a novel method, Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), to extend RLVR from task-level to criterion-level verification. This work is presented as a preprint and has not yet undergone peer review.

Method
The core technical contribution of $\text{RLR}^3$ is its dual-path execution framework that utilizes large language models (LLMs) for robust rubric-based reward assessment. The architecture consists of two main components: an LLM-as-an-extractor paired with a deterministic verifier for instance-specific rubrics, and an LLM-as-a-Judge for non-verifiable criteria. To enhance scoring fidelity, the authors introduce a minimal exposure strategy that conceals ground truths from extractors and images from judges, thereby reducing bias in the evaluation process. Additionally, $\text{RLR}^3$ employs hierarchical aggregation to prioritize critical criteria over supplementary ones, addressing potential score saturation within rollout groups. The training compute details are not disclosed, but the model is evaluated on the Qwen3-VL-30B-A3B architecture.

Results
The performance of $\text{RLR}^3$ is benchmarked across 15 tasks, demonstrating a consistent improvement over the RLVR baseline. Specifically, $\text{RLR}^3$ achieves a 4.7-point increase in performance metrics compared to the base model, surpassing the gap observed with the official instruct-to-thinking model. Controlled audits validate the effectiveness of the deterministic verification and minimal exposure strategies, indicating a significant reduction in exploitable false positives, which enhances the reliability of the reward signals.

Limitations
The authors acknowledge that while $\text{RLR}^3$ improves upon RLVR, it may still face challenges in fully capturing the complexity of certain non-verifiable criteria. The reliance on LLMs introduces potential biases inherent to the models used, which could affect the robustness of the evaluations. Additionally, the paper does not discuss the computational efficiency or scalability of the proposed method in real-world applications, which could be critical for deployment in large-scale systems.

Why it matters
The implications of this work are significant for advancing reinforcement learning methodologies, particularly in domains requiring nuanced evaluation criteria, such as natural language processing and computer vision. By enabling criterion-level verification, $\text{RLR}^3$ enhances the granularity of reward signals, potentially leading to more effective learning in complex environments. This approach could pave the way for future research on integrating multi-criteria supervision in RL frameworks, fostering the development of more sophisticated AI systems capable of handling intricate tasks.

Authors: Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang et al.
Source: arXiv:2605.30244
URL: https://arxiv.org/abs/2605.30244v1

By Callan Zhang · May 28, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI