Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
Andreas Maier, Moritz Zaiss, Siming Bayer
- Published
- May 4, 2026 — 14:10 UTC
- Summary length
- 450 words
- Relevance score
- 80%
Problem
This preprint addresses the gap in empirical reproducibility within the NLP community, specifically focusing on personal-style post-editing of large language model (LLM) drafts. The authors aim to replicate a recent ACL 2026 study that explored the correlation between perceived self-similarity and embedding-measured self-similarity in text, which previously required extensive human effort. By leveraging a modern agentic-research framework, they seek to demonstrate that rapid experimentation can yield comparable or superior results to traditional methods.
Method
The authors utilize a systematic approach to reproduce all seven preregistered hypotheses from the original study, employing a human investigator solely as a reviewer-in-the-loop. They conduct experiments using GPT-5.5 and Claude Opus 4.7, focusing on a leakage-free held-out protocol across 324 paired tasks. The core technical contribution includes a leave-authors-out linear SVM trained on LUAR-MUD embeddings to assess AI text detection capabilities. The training involved 20 feedback iterations against a frozen detector, allowing the Opus agent to adapt its outputs to evade detection. The study also introduces three new experiments that extend the original findings.
Results
The authors successfully replicate the original study’s correlation between perceived self-similarity and embedding-measured self-similarity, reporting a correlation coefficient of ( r = +0.244 ) with ( p < 10^{-8} ) based on a sample size of 648. In terms of performance, GPT-5.5 and Claude Opus 4.7 close 71% to 75% of the style gap to the same-author ceiling, significantly outperforming human post-editing, which only closed 24%. The LLMs surpassed human performance on approximately 80% of the tasks. For AI detection, the linear SVM achieved an AUC ranging from 0.93 to 1.00 across different approaches, indicating high efficacy in distinguishing between AI-generated and human-generated text.
Limitations
The authors acknowledge that their findings are contingent on the specific models and datasets used, which may not generalize to other contexts or LLMs. They also note that the detection capabilities of GPT-5.5 are primarily influenced by text length, suggesting a potential confound in the detection process. Additionally, the study does not explore the ethical implications of using LLMs to evade detection, nor does it address the long-term consequences of such an arms race in AI-generated content.
Why it matters
This work has significant implications for the ongoing discourse around AI-generated text and its detection. By demonstrating that advanced LLMs can effectively reduce their detectability with minimal human intervention, the study raises critical questions about the reliability of current detection methods and the potential for misuse in generating deceptive content. The findings could inform future research on improving detection algorithms and understanding the stylistic signatures of AI-generated text, ultimately shaping the development of more robust frameworks for distinguishing between human and machine-generated content.
Authors: Andreas Maier, Moritz Zaiss, Siming Bayer
Source: arXiv:2605.02620
URL: https://arxiv.org/abs/2605.02620v1