Major alignment safety

General Preference Reinforcement Learning

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo

Published
May 18, 2026 — 17:50 UTC

Problem
This paper addresses the disconnect between two prevalent tracks in large language model (LLM) alignment: online reinforcement learning (RL) with verifiable rewards and preference optimization for open-ended generation. The authors argue that existing scalar reward models are inadequate for capturing the multi-dimensional nature of quality in open-ended tasks, leading to potential exploitation of specific axes in reward space. This work is presented as a preprint, indicating it has not yet undergone peer review.

Method
The authors introduce the General Preference Model (GPM), which embeds responses into ( k ) skew-symmetric subspaces, allowing for structured, intransitivity-aware comparisons of preferences. Building on this framework, they propose General Preference Reinforcement Learning (GPRL). GPRL computes per-dimension group-relative advantages, normalizes these advantages on their respective scales to prevent any single axis from dominating the learning process, and aggregates them using context-dependent eigenvalues. Additionally, GPRL incorporates a closed-loop drift monitor that detects and corrects for single-axis exploitation by dynamically reweighting dimensions and tightening the trust region during training. The model is evaluated starting from the (\texttt{Llama-3-8B-Instruct}) architecture.

Results
GPRL achieves a length-controlled win rate of 56.51% on the AlpacaEval 2.0 benchmark. It also demonstrates superior performance compared to named baselines, including SimPO and SPPO, on multiple challenging benchmarks: Arena-Hard, MT-Bench, and WildBench. The results indicate that GPRL effectively resists reward hacking over extended training runs, showcasing its robustness in maintaining quality across diverse tasks.

Limitations
The authors acknowledge that while GPRL addresses the limitations of scalar reward models, it may still be sensitive to the choice of ( k ) and the specific structure of the skew-symmetric subspaces. They do not discuss potential computational overhead associated with the increased complexity of the GPM or the scalability of the approach to larger models or more complex tasks. Additionally, the reliance on context-dependent eigenvalues may introduce variability that could affect reproducibility.

Why it matters
This work has significant implications for the field of LLM alignment, particularly in bridging the gap between structured preference optimization and online RL. By providing a framework that captures the multi-dimensional nature of quality in generative tasks, GPRL could enhance the robustness and adaptability of LLMs in real-world applications. The approach may inspire further research into multi-faceted reward structures and their integration into RL frameworks, potentially leading to more sophisticated and capable AI systems.

Authors: Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi
Source: arXiv:2605.18721
URL: https://arxiv.org/abs/2605.18721v1

Turing Wire

By Turing Wire editorial staff · May 18, 2026 · Editorial standards →

Source: arXiv cs.CL