CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio
- Published
- Jun 9, 2026 — 16:24 UTC
Problem
The paper addresses the gap in understanding how frontier large language models (LLMs) respond to control interventions, particularly their ability to recognize when their outputs are being modified by external control protocols. This is crucial for ensuring the safety and reliability of AI systems, especially in untrusted environments. The authors note that existing literature lacks a systematic approach to measure this awareness, and the work is presented as a preprint, indicating it has not yet undergone peer review.
Method
The authors propose CIAware-Bench, a benchmark designed to evaluate control intervention (CI) awareness across eleven frontier LLMs. The benchmark consists of four distinct task domains: essay writing, BigCodeBench, Bash Arena, and SHADE-Arena. The evaluation methodology involves varying parameters such as trajectory watermarking, the presence of side-tasks, and the specifics of the control protocol employed. The models are assessed on their ability to distinguish between their original outputs and those modified by control interventions, with performance measured using balanced binary classification accuracy. The authors report that the models exhibit low to moderate CI awareness, with a maximum accuracy of 0.87, where random chance is 0.5.
Results
The evaluation reveals substantial variability in CI awareness across different task domains and model pairs. Notably, detection of control interventions is generally more successful within the same model family, suggesting that models leverage provider-specific stylistic differences or post-training characteristics. The results indicate that CI awareness is not a static property of a model but varies significantly with each new model release and deployment context. The findings underscore the necessity for continuous assessment of CI awareness as models evolve.
Limitations
The authors acknowledge that their benchmark may not cover all possible control intervention scenarios and that the results are contingent on the specific configurations tested. They also note that the low to moderate CI awareness observed could be influenced by the inherent characteristics of the models evaluated. Additionally, the study does not explore the implications of CI awareness on the broader safety and ethical considerations of deploying LLMs in real-world applications.
Why it matters
The introduction of CIAware-Bench provides a critical tool for researchers and practitioners to assess and enhance the safety of AI systems by measuring their susceptibility to control interventions. This work has significant implications for the development of robust control protocols that can effectively manage untrusted models, thereby improving the reliability of AI applications in sensitive domains. The findings highlight the need for ongoing evaluation of model behavior in response to control interventions, as emphasized in the paper. This research is available on arXiv and sets the stage for future investigations into the dynamics of model awareness and control mechanisms.
By Turing Wire editorial staff · Jun 9, 2026 · Editorial standards →
Source: arXiv cs.AI