Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali
- Published
- May 21, 2026 — 15:30 UTC
Problem
This paper addresses the limitations of existing Reinforcement Learning from Internal Feedback (RLIF) methods, which typically utilize a single internal reward signal. Such approaches are prone to issues like reward hacking, entropy collapse, and degraded reasoning structures. The authors highlight that while Reinforcement Learning with Verifiable Rewards (RLVR) has enhanced the reasoning capabilities of large language models (LLMs), it often requires external supervision, which is not scalable. This work proposes a multi-reward RLIF framework to mitigate these issues, making it particularly relevant for unsupervised learning scenarios. The paper is a preprint and has not yet undergone peer review.
Method
The proposed framework decomposes the training signal into two distinct reward components: an answer-level reward derived from cluster voting and a completion-level reward based on token-wise self-certainty. The answer-level reward aggregates feedback from multiple model outputs to assess correctness, while the completion-level reward evaluates the confidence of individual tokens in the generated sequences. To effectively combine these rewards, the authors employ GDPO-based normalization, which addresses reward-scale imbalances. Additionally, they introduce KL-Cov regularization, which specifically targets low-entropy token distributions that contribute to entropy collapse, thereby preserving exploration capabilities during training. The architecture and training compute specifics are not disclosed, but the method emphasizes the importance of complementary internal rewards and targeted regularization.
Results
The proposed multi-reward RLIF framework demonstrates significant improvements in stability and robustness compared to prior unsupervised RL methods. On mathematical reasoning and code-generation benchmarks, the framework achieves performance metrics that are competitive with those of supervised RLVR methods. While specific numerical results are not provided in the abstract, the authors claim that their approach yields performance “close to” supervised methods, indicating a substantial effect size in the context of unsupervised learning.
Limitations
The authors acknowledge that their framework, while effective, may still face challenges related to the balance between exploration and exploitation, particularly in complex environments. They do not discuss the potential computational overhead introduced by the dual reward signals or the implications of the GDPO normalization and KL-Cov regularization on training time and resource requirements. Additionally, the reliance on cluster voting for the answer-level reward may introduce biases based on the clustering algorithm used.
Why it matters
This work has significant implications for the development of unsupervised reinforcement learning techniques, particularly in scenarios where external supervision is impractical. By demonstrating that complementary internal rewards can enhance the reasoning capabilities of LLMs without relying on human annotations, the framework opens avenues for more scalable and robust AI systems. The introduction of KL-Cov regularization also contributes to the ongoing discourse on maintaining exploration in reinforcement learning, which is critical for long-horizon tasks. Future research can build on these findings to further refine unsupervised learning methodologies and explore their applications in various domains.
Authors: Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali
Source: arXiv:2605.22620
URL: https://arxiv.org/abs/2605.22620v1
By Turing Wire editorial staff · May 21, 2026 · Editorial standards →
Source: arXiv cs.CL