Notable alignment safety

DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models

Lifan Zheng, Xue Yang, Jiawei Chen, Chenyan Wu, Jingyuan Zhang, Fanheng Kong

Original source

arXiv cs.CL

Problem
This preprint addresses the gap in understanding the relationship between neuron modifications and personality representation in large language models (LLMs). Existing methods for personality editing often require extensive neuron modifications, leading to performance degradation. The authors investigate whether all modified neurons are essential for personality representation, highlighting the need for a more targeted approach to neuron editing.

Method
The core technical contribution is the DPN-LE (Dual Personality Neuron Localization and Editing) framework, which identifies personality-specific neurons by contrasting multi-layer perceptron (MLP) activations between high-trait and low-trait samples. The method employs layer-wise steering vectors and dual-criterion filtering based on Cohen’s $d$ effect size and activation magnitude to isolate mutually exclusive neuron subsets. This allows for sparse linear interventions on approximately 0.5% of the neurons, enabling precise personality control during inference. The training process utilizes only 1,000 contrastive sample pairs per trait, optimizing the editing process while preserving general capabilities.

Results
DPN-LE was evaluated on LLaMA-3-8B-Instruct and Qwen2.5-7B-Instruct models. The results indicate that DPN-LE achieves competitive personality control while maintaining overall model performance, outperforming existing neuron-editing methods that typically lead to significant performance drops. The authors report that their approach preserves reasoning capabilities better than previous methods, although specific quantitative performance metrics against named baselines are not disclosed in the abstract.

Limitations
The authors acknowledge that while DPN-LE effectively identifies and edits personality-specific neurons, the reliance on a limited number of contrastive sample pairs may restrict the generalizability of the findings across diverse personality traits. Additionally, the method’s performance on more complex personality dimensions or in real-world applications remains untested. The authors do not discuss potential biases introduced by the selection of sample pairs or the implications of sparse interventions on model interpretability.

Why it matters
The implications of this work are significant for downstream applications in personalized AI systems, where nuanced personality representation is crucial. By enabling targeted neuron editing, DPN-LE offers a pathway to enhance LLMs’ adaptability to user preferences without compromising their general capabilities. This could lead to more effective human-AI interactions and personalized content generation, advancing the field of personality-aware AI.

Authors: Lifan Zheng, Xue Yang, Jiawei Chen, Chenyan Wu, Jingyuan Zhang, Fanheng Kong, Xinyi Zeng, Xiang Chen et al.
Source: arXiv:2604.27929
URL: https://arxiv.org/abs/2604.27929v1