Major alignment safety UiPath

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Lisa Bouger, Théo Lasnier, Philippe Looubet Moundi, Yannick Teglia, Djamé Seddah

Published: Jun 2, 2026 — 15:38 UTC

Problem
Backdoor attacks in Large Language Models (LLMs) pose significant security risks, as they allow adversaries to manipulate model outputs through specific triggers. Existing defenses primarily focus on known backdoors, requiring prior knowledge of the trigger, which limits their effectiveness against unknown backdoors. This paper addresses the gap in the literature regarding the generalization of backdoor neutralization techniques, particularly in the context of LLMs. The authors present a preprint study that explores the potential for unlearning to generalize across multiple backdoors, thereby enhancing model safety.

Method
The authors propose a method for backdoor neutralization through a process they term “unlearning.” They investigate this phenomenon across three distinct model families, where backdoors were introduced via pretraining or continual pretraining. The core technical contribution is the introduction of the Cross Activation Shift Distance (CASD), a metric designed to quantify the changes in model activations resulting from the unlearning of specific backdoors. The training process involves selectively removing one backdoor at a time and analyzing the subsequent effects on the model’s ability to suppress other, previously unknown backdoors. The authors do not disclose specific training compute or dataset details, focusing instead on the conceptual framework and empirical results.

Results
The experiments demonstrate that unlearning a single backdoor can lead to the suppression of other backdoors that were not explicitly targeted. The authors report significant improvements in model safety, with a notable reduction in the activation of unknown triggers following the unlearning process. While specific numerical results are not provided in the abstract, the findings suggest a robust cross-backdoor transfer effect, indicating that the proposed method can effectively enhance the resilience of LLMs against a variety of backdoor attacks.

Limitations
The authors acknowledge that their approach may not be universally applicable to all types of backdoor attacks, particularly those that are highly complex or deeply integrated into the model’s architecture. Additionally, the study is limited by its preprint status, which means it has not undergone peer review. The authors do not address potential computational costs associated with the unlearning process or the scalability of their method across larger models or diverse datasets.

Why it matters
This research has significant implications for the field of LLM safety, as it opens new avenues for developing robust defenses against backdoor attacks. By demonstrating that controlled backdoor injection followed by unlearning can suppress unknown triggers, the authors provide a framework that could be leveraged in future model training and evaluation protocols. This work contributes to the ongoing discourse on model security and robustness, as highlighted in related literature on adversarial machine learning and model interpretability, as published in arXiv.

By Callan Zhang · Jun 2, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL