Major alignment safety

Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah

Published: May 18, 2026 — 16:53 UTC

Problem
This preprint addresses the gap in understanding the internal mechanisms of backdoor attacks on language models, specifically focusing on language-switching triggers. While backdoor attacks are a recognized security threat, the authors highlight that the pathways through which these triggers manipulate model outputs remain inadequately explored. The study investigates a specific case where a three-word Latin trigger redirects English outputs to French in an 8B-parameter autoregressive language model.

Method
The authors propose a detailed analysis of the internal circuit activated by the language-switching trigger. The methodology is structured into three distinct phases:

Trigger Composition: Early layers utilize distributed attention heads to compose the trigger tokens into a signal at the last sequence position.
Signal Propagation: This signal then propagates through mid-layers, specifically in a subspace orthogonal to the model’s natural language-identity direction, indicating a unique encoding pathway.
Logit Conversion: Finally, a multi-layer perceptron (MLP) at the output layer transforms this latent signal into French logits. The authors emphasize that the entire circuit is bottlenecked at a single position; any corruption at this position disrupts the trigger’s effect but also degrades the model’s overall performance.

Results
The authors demonstrate that the language-switching trigger effectively redirects outputs with high fidelity, although specific quantitative results (e.g., accuracy metrics or effect sizes) are not disclosed in the abstract. The findings suggest that traditional defenses, which typically search for language-like signals in intermediate representations, would fail to detect this backdoor due to its orthogonal encoding nature. The implications of this are significant, as it indicates a need for novel defense mechanisms that can identify such latent manipulations.

Limitations
The authors acknowledge that their findings are limited to a specific model architecture (an 8B-parameter autoregressive language model) and a particular type of trigger (three-word Latin). They do not explore the generalizability of their results across different architectures or trigger types. Additionally, the study does not provide empirical performance metrics or comparisons against existing backdoor detection methods, which could have strengthened their claims. The potential for model performance degradation when mitigating the trigger is also a critical limitation, as it raises questions about the trade-offs involved in implementing defenses.

Why it matters
This work has significant implications for the security of language models, particularly in applications where model integrity is paramount. By elucidating the internal mechanisms of a language-switching backdoor, the study paves the way for developing more robust defenses against such attacks. The identification of an orthogonal latent encoding pathway suggests that existing detection strategies may be insufficient, necessitating a reevaluation of how we approach model security. This research could inform future work on adversarial robustness and the design of language models that are resilient to backdoor manipulations.

Authors: Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah
Source: arXiv:2605.18646
URL: https://arxiv.org/abs/2605.18646v1

By Callan Zhang · May 18, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL