Major alignment safety Alibaba

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

Published
May 28, 2026 — 16:32 UTC

Problem
This preprint addresses a significant gap in the security of fine-tuned large language models (LLMs) utilizing Low-Rank Adaptation (LoRA) adapters. Specifically, it investigates the vulnerability of these adapters to backdoor attacks via training data poisoning, which can be executed while maintaining baseline task performance. The authors highlight that existing literature lacks a comprehensive understanding of how such backdoors can generalize at the token level, rather than through structural patterns, complicating detection efforts.

Method
The authors propose a systematic characterization of the backdoor attack across various parameters, including base-model scale, model family, LoRA rank, and trigger string. They demonstrate that a small fraction of poisoned training examples can saturate a backdoor in a Qwen 2.5 1.5B prompt-injection classifier without degrading clean accuracy. The attack’s effectiveness is attributed to its token-level generalization, where a model trained on specific references activates on similar tokens but fails to generalize to structurally identical but contextually different references. For detection, two complementary methods are introduced: a behavioral detector leveraging statistics from a probe battery (outlier_gap and mean_attack_rate) and a weight-level detector based on the cross-module standard deviation of dimension-normalized Frobenius norms. The behavioral detector achieves perfect separation of poisoned and clean adapters, particularly when the probe overlaps with the trigger’s token neighborhood, while the weight-level detector operates independently of model execution.

Results
The behavioral detection method achieves perfect accuracy in distinguishing poisoned from clean adapters, with high recall and zero false positives when the probe does not overlap with the trigger’s token neighborhood. The weight-level detection method also demonstrates perfect separation but is constrained by the calibration of the base model. The attack’s effectiveness scales with the LoRA rank, and the choice of trigger-anchor token is shown to be both model-dependent and trigger-dependent. The authors provide empirical evidence that their detection methods transfer across different model scales and families without the need for retuning.

Limitations
The authors acknowledge that the weight-level detection method is calibration-bound to the base model, which may limit its applicability across different architectures. Additionally, while the behavioral detection method shows robustness, its performance may vary with different probe compositions. The study does not explore the potential for adversarial examples or other forms of evasion attacks against the proposed detection methods.

Why it matters
This work has significant implications for the security of LLMs in production environments, particularly those utilizing LoRA adapters. By elucidating the mechanics of backdoor attacks and providing effective detection strategies, the research contributes to the development of more secure AI systems. The findings underscore the necessity for robust supply chain scanning mechanisms for adapter-based models, which could inform future research on adversarial robustness and model integrity.

Authors: Travis Lelle
Source: arXiv:2605.30189
URL: https://arxiv.org/abs/2605.30189v1

Turing Wire

By Turing Wire editorial staff · May 28, 2026 · Editorial standards →

Source: arXiv cs.CL