Major alignment safety

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady, Doowon Kim

Published: Jun 8, 2026 — 16:21 UTC

Problem
This work addresses a critical gap in the literature regarding the vulnerability of large language model (LLM)-powered content moderation systems to adversarial attacks that exploit human perceptual cues. The authors highlight that existing moderation systems primarily analyze tokenized text, neglecting visual elements that humans utilize for content interpretation. This oversight creates a perceptual mismatch, where harmful content is easily recognized by humans but remains undetectable by automated systems. The paper is a preprint and has not undergone peer review.

Method
The authors propose a novel class of attacks termed Human-Perceptible Adversarial Attacks (HPAA). These attacks embed harmful expressions into benign text using typographic manipulations such as spacing, visual emphasis, and spatial arrangement. The attack operates in a black-box setting, requiring only a small query budget without needing access to the model or gradient information. The methodology involves generating typographically altered text that retains high human recognition rates while significantly reducing machine detectability. The authors conduct ablation studies to analyze the impact of various typographic features on the success of the attacks.

Results
The evaluation of HPAA was conducted across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source models. The results demonstrate a stark contrast between human and machine perception: the generated attacks achieved over 86% human recognition with only three detector queries, while maintaining detection rates below 1% across the evaluated systems. This indicates a significant effectiveness of the proposed attack strategy in evading current moderation technologies.

Limitations
The authors acknowledge that their approach relies on specific typographic manipulations, which may not generalize across all content types or languages. Additionally, the study does not explore the long-term implications of these attacks on moderation systems or the potential for adversarial training to mitigate such vulnerabilities. The paper also does not address the ethical implications of deploying such attacks in real-world scenarios.

Why it matters
The findings of this research expose a fundamental blind spot in the current LLM-based moderation ecosystem, emphasizing the need for systems that incorporate human-like perceptual reasoning. The implications are significant for the design of future content moderation frameworks, which must account for visual cues to enhance their robustness against adversarial attacks. This work serves as a call to action for researchers and practitioners to develop more sophisticated moderation systems that align with human perception, as discussed in the context of adversarial robustness in machine learning, available on arXiv.

By Callan Zhang · Jun 8, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG