Notable training methods

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

Yang Bai, Kaiyuan Liu, Ziyuan Zhuang, Jiahong Zhou, Rongxiang Weng, Xin Chen

Published: May 13, 2026 — 15:05 UTC

Problem
This paper addresses the challenges posed by multi-task and mixed-reward formulations in complex reinforcement learning (RL) environments, particularly the issues of heterogeneous reward distributions and correlated reward dimensions that can destabilize scalar advantage construction. The authors propose a novel approach, Reward-Decorrelated Policy Optimization (RDPO), to mitigate these issues. This work is presented as a preprint and has not yet undergone peer review.

Method
The core technical contribution of this paper is the RDPO framework, which consists of two main components. First, it employs Magnitude-Aware Quantile normalization to stabilize advantage allocation across various reward types, including binary, fractional, and continuous rewards. This normalization technique ensures that the advantages derived from different reward signals are comparably scaled, thus enhancing the stability of the learning process. Second, RDPO utilizes Mahalanobis whitening within each active reward subspace to address correlation redundancy before aggregating rewards. This step effectively decorrelates the reward signals, allowing for a more robust optimization process. The authors apply RDPO during the post-training phase of the LongCat-Flash model, which is a large language model designed for instruction following and other complex tasks.

Results
The application of RDPO leads to significant improvements in several performance metrics. Specifically, the authors report enhanced instruction following, improved writing quality, and increased robustness to challenging prompts. While the paper does not provide explicit numerical results against named baselines, it claims that RDPO remains broadly competitive on reasoning and coding evaluations, suggesting that it achieves a favorable trade-off between performance and stability compared to existing methods. The authors imply that RDPO outperforms traditional scalar advantage methods, although specific effect sizes and benchmark comparisons are not detailed.

Limitations
The authors acknowledge that RDPO may introduce additional computational overhead due to the normalization and whitening processes, which could impact training efficiency. They also note that the method’s effectiveness may vary depending on the specific characteristics of the reward structure in different environments. An obvious limitation not discussed by the authors is the potential for overfitting when applying RDPO in environments with sparse rewards, as the decorrelation process may inadvertently amplify noise in the reward signals.

Why it matters
The implications of this work are significant for the field of reinforcement learning, particularly in scenarios involving complex, multi-objective tasks. By providing a method to effectively handle correlated and heterogeneous rewards, RDPO could enhance the stability and performance of RL agents in real-world applications, such as robotics, natural language processing, and multi-agent systems. This research opens avenues for further exploration into reward processing techniques and their integration into existing RL frameworks, potentially leading to more robust and adaptable learning systems.

Authors: Yang Bai, Kaiyuan Liu, Ziyuan Zhuang, Jiahong Zhou, Rongxiang Weng, Xin Chen, Jingang Wang, Xunliang Cai
Source: arXiv:2605.13641
URL: https://arxiv.org/abs/2605.13641v1

By Callan Zhang · May 13, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL