AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development
Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby
- Published
- May 4, 2026 — 15:41 UTC
- Summary length
- 458 words
- Relevance score
- 80%
Problem
This preprint addresses a significant gap in the literature regarding the long-term maintainability of software generated by Large Language Models (LLMs) and AI agents. While prior research has predominantly focused on functional correctness, this work highlights the overlooked issue of technical debt introduced by AI-generated code. The authors systematically analyze the architectural flaws and defects that arise from AI-driven development, revealing a distinct pattern of degradation that challenges the prevailing assumptions about the capabilities of LLMs in software engineering.
Method
The authors employ a multi-scale analysis framework to evaluate AI-generated software across two domains: single-file algorithmic tasks and complex systems generated by agents. They introduce the concept of the Reasoning-Complexity Trade-off, which posits that as LLMs increase in capability, the generated code becomes more bloated and coupled. The study establishes a Volume-Quality Inverse Law, indicating that code volume serves as a strong predictor of structural degradation. The analysis includes a variety of metrics to assess architectural quality, although specific loss functions, training compute, and datasets used for the LLMs are not disclosed in the paper.
Results
The findings reveal that AI-generated code exhibits a significant increase in technical debt, with a clear correlation between code volume and architectural flaws. The authors demonstrate that neither functional correctness nor detailed prompting strategies effectively mitigate the observed decay in code quality. The results suggest that the architectural integrity of AI-generated software diminishes as model complexity increases, challenging the effectiveness of current prompt-driven generation techniques. The paper provides quantitative evidence supporting the Volume-Quality Inverse Law, although specific numerical results and comparisons to baseline models are not detailed in the abstract.
Limitations
The authors acknowledge that their analysis is limited to the specific types of tasks and models evaluated, which may not generalize across all AI-generated software scenarios. They do not address the potential impact of different programming languages or frameworks on the observed architectural decay. Additionally, the study does not explore the implications of varying levels of model training data or the influence of different prompting techniques on the results. The lack of detailed metrics and comparisons to established baselines may also limit the reproducibility of the findings.
Why it matters
This work has significant implications for the future of AI-assisted software engineering. By reframing the challenge from mere code generation to architectural complexity management, it emphasizes the need for LLMs and AI agents to incorporate architectural foresight in their design processes. This shift could lead to the development of more maintainable and robust software systems, ultimately enhancing the long-term viability of AI-generated code. The findings encourage researchers and practitioners to prioritize architectural considerations in the design of AI systems, paving the way for more sustainable software engineering practices.
Authors: Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby
Source: arXiv:2605.02741
URL: https://arxiv.org/abs/2605.02741v1