Major alignment safety

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Yitong Zhang, Shiteng Lu, Jia Li

Published
Jun 10, 2026 — 08:50 UTC

Problem
This work addresses a critical gap in the security of Large Language Models (LLMs) used for code generation, particularly the unintended consequences of employing Grammar-Constrained Decoding (GCD). While GCD is designed to enhance the syntactic validity of generated code, the authors reveal that it can be exploited to induce LLMs to produce malicious code. This research is particularly relevant as it highlights a novel attack vector, termed CodeSpear, which leverages GCD to bypass safety mechanisms. The paper is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.

Method
The authors introduce CodeSpear, a jailbreak attack that manipulates GCD to generate harmful code. The attack exploits the constraints of benign code grammar to trick LLMs into producing malicious outputs. To counter this vulnerability, the authors propose CodeShield, a safety alignment mechanism that teaches LLMs to generate “honeypot” code—code that appears valid but is semantically harmless. CodeShield operates by aligning the model’s output with safety protocols while maintaining the ability to refuse malicious requests when prompted in natural language. The training process for CodeShield involves using a diverse set of benign code structures to ensure robustness against grammar tightening, thus complicating suppression efforts.

Results
The experimental results demonstrate that CodeSpear significantly outperforms existing jailbreak methods, achieving an average attack success rate increase of over 30 percentage points across 10 popular LLMs evaluated on 4 distinct benchmarks. The authors provide specific metrics indicating that the attack’s efficacy is markedly higher than that of representative baselines, underscoring the effectiveness of the proposed jailbreak technique. Furthermore, CodeShield successfully mitigates the risks posed by CodeSpear, restoring safety while preserving the utility of the LLMs.

Limitations
The authors acknowledge that while CodeShield effectively addresses the vulnerabilities identified, it may not be foolproof against all potential attack vectors. They do not explore the scalability of CodeShield across all possible LLM architectures or the implications of adversarial training on model performance. Additionally, the paper does not address the potential for attackers to develop more sophisticated grammar constraints that could circumvent CodeShield’s defenses.

Why it matters
This research has significant implications for the deployment of LLMs in sensitive applications, particularly in code generation contexts. The findings underscore the need for a reevaluation of GCD as a safety mechanism, highlighting its potential to serve as an attack surface rather than a protective measure. The introduction of CodeShield offers a promising avenue for enhancing the security of LLMs, but it also calls for ongoing vigilance and innovation in safety alignment strategies. This work contributes to the broader discourse on AI safety and security, as discussed in related literature, and is available on arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 10, 2026 · Editorial standards →

Source: arXiv cs.CL