AI Code Iterations Introduce More Security Flaws

Defenses

Published: Tue, May 20, 2025 • By Dr. Marcus Halden

AI Code Iterations Introduce More Security Flaws

New research finds that iterative AI code improvements often add security flaws rather than fix them. In controlled tests, vulnerabilities rise sharply after a few automated iterations and different prompt styles create distinct risk patterns. The study urges mandatory human checks, routine analysis tools, and limits on consecutive AI-only edits to prevent unsafe regressions.

In plain terms: letting AI repeatedly rewrite code can make things worse. A controlled study of 400 samples across many automated improvement rounds finds a 37.6% jump in critical vulnerabilities after only five iterations. Even prompts that ask the model to improve security can introduce new, subtle bugs.

The researchers ran an experiment with four common prompting styles and tracked vulnerabilities in C and Java code using static tools plus manual review. Some prompts trade speed for memory mistakes, others add features that create race conditions, and security-focused prompts sometimes produce flawed crypto or outdated patterns. The upshot is clear: more AI edits do not equal safer code.

Why this matters for teams: AI assistants can accelerate development, but they can also silently erode safety if you let automated rounds run unchecked. Complexity tends to grow with each iteration, and higher complexity correlates strongly with more security issues. Think of the model like a well-meaning intern who keeps rearranging cables until a server trips.

Operational takeaways

Limit consecutive AI-only revisions to three to reduce risk.
Require human review between iterations, especially for crypto and memory code.
Run static and dynamic analysis after every iteration and watch for complexity spikes.
Prioritize human sign-off before merging AI-generated changes into critical systems.

The lesson is practical: use AI to assist, not to autonomously drive code evolution. With simple guardrails and routine checks, teams can keep the speed gains while avoiding paradoxical security degradation.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Security Degradation in Iterative AI Code Generation -- A Systematic Analysis of the Paradox

Authors: Shivani Shukla, Himanshu Joshi, and Romilla Syed

The rapid adoption of Large Language Models(LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of "improvements" using four distinct prompting strategies. Our findings show a 37.6% increase in critical vulnerabilities after just five iterations, with distinct vulnerability patterns emerging across different prompting approaches. This evidence challenges the assumption that iterative LLM refinement improves code security and highlights the essential role of human expertise in the loop. We propose practical guidelines for developers to mitigate these risks, emphasizing the need for robust human validation between LLM iterations to prevent the paradoxical introduction of new security issues during supposedly beneficial code "improvements".

🔍 ShortSpan Analysis of the Paper

Problem

This paper investigates how security vulnerabilities evolve when code generated by large language models is iteratively improved through feedback, a phenomenon termed feedback loop security degradation. In a controlled experiment, the authors analyse 400 code samples across 40 rounds of improvements using four prompting strategies. The study reports a 37.6 per cent increase in critical vulnerabilities after five iterations, with distinct vulnerability patterns emerging for different prompting approaches. The findings challenge the assumption that iterative refinement enhances code security and emphasise the essential role of human expertise in the development loop. The authors propose practical guidelines to mitigate these risks, stressing robust human validation between iterations to prevent paradoxical introduction of new security issues during so called improvements.

Approach

The study uses a controlled experiment with 400 generated samples across 40 rounds of automated improvements, across four prompting strategies: Efficiency focused, Feature focused, Security focused, and Ambiguous improvement. Ten secure baseline samples in C and Java were selected and verified as vulnerability free by multiple static analysis tools and expert review. The primary language model was GPT-4o, used with temperature 0.7 and top_p 1.0 to ensure reproducibility. After each iteration, code was analysed with static analysis tools including Clang Static Analyzer, CodeQL and SpotBugs, supplemented by manual security review. Vulnerabilities were categorised into twelve classes and assigned severities using a CVSS based framework. For each iteration the authors collected the generated code, vulnerabilities detected, changes in code complexity metrics (cyclomatic complexity and lines of code) and functional correctness. The experiment produced 400 code samples (10 baselines × 4 prompting strategies × 10 iterations per sample). Representative prompts for each strategy and a structured vulnerability analysis framework underpinned the study.

Key Findings

Across all prompting strategies, the study identified 387 distinct security vulnerabilities, with later iterations showing higher vulnerability counts than early ones.
The feature focused prompting strategy yielded the largest total vulnerability count (158 occurrences), while the security focused prompting yielded the smallest (38), yet even explicitly seeking security improvements introduced new vulnerabilities.
Early iterations averaged 2.1 vulnerabilities per sample (SD 0.9); middle iterations averaged 4.7 (SD 1.2); late iterations averaged 6.2 (SD 1.8). Repeated measures analysis indicated significant differences between early and late iterations with a medium to large effect size.
Vulnerability patterns differed by prompting strategy: efficiency focused prompts were most associated with memory safety issues (42.7%), feature focused with concurrency problems (30.4%), and security focused prompts, despite fewer overall vulnerabilities, had a notable share of cryptographic implementation errors (21.1%).
Qualitative analysis revealed three recurrent cryptographic patterns: library misuse, overengineering leading to flawed integrations, and reliance on outdated security patterns. This suggests limitations in LLMs' understanding of secure coding contexts and practical implementations of security principles.
The study found a positive association between code complexity and vulnerabilities, with a reported correlation of r = 0.64. A 10 per cent increase in complexity was linked to a 14.3 per cent increase in vulnerability count, even after accounting for prompting strategy and baseline code characteristics.
Case studies illustrated how memory management, authentication logic, and database access layers evolved to introduce new vulnerabilities through iterative improvements, including memory reuse, timing side channels, SQL injection risks, and race conditions. Some early improvements did lead to net security gains, but these were often offset by later, subtler issues.
Mitigation insights emerged: human review between iterations is crucial; limiting consecutive LLM only iterations to three can reduce risk; security reviews after each iteration and the use of static analysis tools in between iterations are recommended; monitoring for rapid complexity growth is advised as a predictor of vulnerability introduction.

Limitations

The study focuses on a single large language model, GPT-4o, and on two programming languages, C and Java, which may limit generalisability to other models and languages. The experiment simulated fully automated feedback loops without human input between iterations, representing a worst case for automated evolution but not typical real world practice where human review occurs. The authors acknowledge that real world development usually includes human involvement that could mitigate observed degradation.

Why It Matters

The results have practical implications for software development, tooling and governance in AI aided programming. They highlight the risk that iterative AI assisted code generation can propagate or amplify security flaws, potentially affecting software supply chains and critical systems. The work supports a shift toward human in the loop practices, emphasising security review between iterations, routine static analysis, and careful monitoring of code complexity. It also points to the need for security aware features in AI coding assistants and for automated critics that detect potentially problematic transformations. Overall the research advocates a hybrid approach where AI assists rather than autonomously drives code evolution, underpinned by rigorous human validation and conventional security testing.

Attribution Original paper on arXiv