Major alignment safety

The Attentional White Bear Effect in Transformer Language Models

Rebecca Ramnauth, Brian Scassellati

Published
May 27, 2026 — 15:45 UTC

Problem
This preprint addresses the gap in understanding the efficacy of instruction-based suppression techniques in transformer language models, specifically whether these techniques reduce the internal representation of prohibited content or merely suppress its expression. The authors highlight that while suppression methods are commonly employed to prevent the generation of unwanted content, the underlying representational dynamics remain poorly understood. This work aims to clarify the relationship between behavioral suppression and internal representations in language models.

Method
The authors employ a multi-faceted approach that includes representational probing, attention analysis, and behavioral semantic leakage experiments across various transformer architectures. They analyze hidden representations to assess the recoverability of prohibited concepts under suppression. Attention routing is examined to determine how suppressed concepts influence model behavior. The study utilizes multiple pooling strategies and indirect semantic controls to ensure robustness in their findings. Specific model families are not disclosed, but the experiments span several architectures, indicating a comprehensive evaluation of the suppression effects.

Results
The findings reveal that prohibited concepts remain highly recoverable from the hidden representations of the models, even when suppression is applied. The authors report that these concepts continue to influence attention mechanisms and shape downstream text generations, despite successful lexical avoidance. The results indicate a significant disconnect between behavioral outcomes (i.e., the absence of prohibited content in generated text) and representational alignment (i.e., the retention of prohibited concepts in internal representations). The authors provide quantitative metrics demonstrating that attention to suppressed concepts does not diminish, with effect sizes indicating a persistent influence on model outputs across different pooling strategies and model families.

Limitations
The authors acknowledge several limitations, including the potential variability in results across different model architectures and the need for further exploration of the representational dynamics in larger, more complex models. They also note that their experiments primarily focus on specific types of prohibited content, which may not generalize to all forms of suppression. Additionally, the study does not explore the long-term implications of these findings on model training or the potential for adversarial exploitation of suppressed representations.

Why it matters
This research has significant implications for the design and deployment of language models in sensitive applications. By exposing the limitations of current suppression techniques, it raises critical questions about the reliability of language models in adhering to ethical guidelines and content policies. The findings suggest that merely suppressing output does not equate to a reduction in internal biases or representations, which could lead to unintended consequences in real-world applications. This work encourages further investigation into more effective methods for content moderation and the development of models that align behavioral outputs with internal representations.

Authors: Rebecca Ramnauth, Brian Scassellati
Source: arXiv:2605.28639
URL: https://arxiv.org/abs/2605.28639v1

Turing Wire

By Turing Wire editorial staff · May 27, 2026 · Editorial standards →

Source: arXiv cs.CL