Notable alignment safety

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

Inoussa Mouiche

Published: May 4, 2026 — 14:15 UTC
Summary length: 448 words
Relevance score: 80%

Problem
This paper addresses a critical gap in the literature on Direct Preference Optimization (DPO) for aligning large language models with human feedback. While DPO simplifies the reinforcement learning process by directly optimizing pairwise preferences, it suffers from a phenomenon termed the “squeezing effect.” This effect leads to a concentration of probability mass on high-confidence predictions, resulting in systematic probability collapse during training. The authors present their work as a preprint, indicating that it has not yet undergone peer review.

Method
The core technical contribution is the introduction of Gradient-Gated Preference Optimization (Gate-DPO). This method stabilizes the training process by modulating the gradients applied to rejected responses based on the model’s probability geometry. Specifically, Gate-DPO employs a gating mechanism that attenuates negative gradients targeting low-probability responses, thereby preventing the harmful effects of the squeezing phenomenon while maintaining standard optimization behavior. The authors assert that Gate-DPO does not alter the underlying preference objective and is complementary to existing methods such as extended supervised fine-tuning (SFT), Inverse Preference Optimization (IPO), and Calibrated DPO. The experiments utilize various architectures and preference datasets, although specific details on the architectures and training compute are not disclosed.

Results
The experimental results demonstrate that Gate-DPO consistently reduces the squeezing effect and enhances the likelihood of chosen responses across multiple architectures and datasets. The authors report that smaller gated models achieve stronger improvements in chosen-response likelihood compared to larger ungated models, suggesting that effective control of gradient dynamics is more crucial than model scale for achieving stable and efficient alignment. While specific numerical results and effect sizes are not provided in the abstract, the qualitative improvements in optimization behavior are emphasized, particularly in terms of healthier mass dynamics and reduced suppression of the overall response distribution.

Limitations
The authors acknowledge that their method does not modify the underlying preference objective, which may limit its applicability in scenarios where the preference structure itself needs to be adjusted. Additionally, the paper does not provide detailed quantitative comparisons against specific baselines, which could help contextualize the improvements achieved with Gate-DPO. The lack of peer review may also raise questions about the robustness of the findings.

Why it matters
The implications of this work are significant for the field of preference optimization in language models. By addressing the squeezing effect, Gate-DPO offers a pathway to more stable and efficient training processes, which is crucial for aligning models with human preferences. The findings suggest that controlling gradient dynamics can lead to better performance in preference-based tasks, potentially influencing future research directions in model alignment and optimization strategies. This work may also encourage further exploration of gradient modulation techniques in other areas of machine learning.

Authors: Inoussa Mouiche
Source: arXiv:2605.02626
URL: https://arxiv.org/abs/2605.02626v1

Author Turing Wire editorial staff

Source

arXiv cs.LG https://arxiv.org/abs/2605.02626v1