Notable alignment safety

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang

Published
Jun 1, 2026 — 17:38 UTC

Problem
The paper addresses the challenge of aligning large language models (LLMs) with human values, which often leads to a degradation of their general capabilities, a phenomenon referred to as the alignment tax. Existing approaches typically rely on extensive general-purpose datasets or auxiliary reward models, which can be resource-intensive and inefficient. The authors argue that safety features are sparse in the output distribution, necessitating localized modifications rather than global trade-offs. This work is presented as a preprint and has not undergone peer review.

Method
The authors propose SafeSteer, an innovative method that employs on-policy distillation focused specifically on safety tokens. The process begins with the construction of a safety teacher model through activation steering, which identifies and emphasizes safety-related outputs. Following this, a safety token selection algorithm is developed to isolate these tokens during training. The key technical contribution is the restriction of the reverse Kullback-Leibler (KL) divergence penalty to the selected safety tokens, thereby preserving the model’s general capabilities while enhancing safety alignment. The training process requires only 100 harmful samples, significantly less than the over 10,000 samples used in previous methods, and does not depend on any general-purpose data.

Results
SafeSteer demonstrates a superior balance between safety and general capability compared to existing methods. The authors report strong safety performance across seven safety benchmarks, achieving an average safety score improvement of 15% over baseline methods. In terms of general capabilities, SafeSteer shows only minimal degradation on five general capability benchmarks, with an average performance drop of less than 2%. These results indicate that SafeSteer effectively mitigates the alignment tax while maintaining robust model performance.

Limitations
The authors acknowledge that the reliance on a limited number of harmful samples may not generalize across all contexts, potentially limiting the robustness of the safety alignment in diverse applications. Additionally, the method’s effectiveness in real-world scenarios remains to be validated, as the experiments are conducted in controlled settings. The paper does not address the potential for overfitting to the selected safety tokens, which could lead to unintended consequences in broader applications.

Why it matters
The implications of this work are significant for the field of AI safety and alignment, as it presents a more efficient approach to aligning LLMs with human values without incurring the high costs associated with traditional methods. By demonstrating that localized modifications can achieve effective safety alignment, this research paves the way for future studies to explore similar strategies in other domains of AI. The findings contribute to the ongoing discourse on balancing safety and performance in AI systems, as discussed in related literature on alignment strategies, and are available on arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 1, 2026 · Editorial standards →

Source: arXiv cs.AI