Study Finds Widespread Vulnerabilities in AI C/C++ Code

Pentesting

Published: Tue, Nov 25, 2025 • By Dr. Marcus Halden

Study Finds Widespread Vulnerabilities in AI C/C++ Code

Researchers test ten Large Language Models (LLMs) that generate C and C++ code and find many outputs contain common, real-world vulnerabilities. Static scanners report dozens of Common Weakness Enumeration (CWE) instances, some mapping to recorded Common Vulnerabilities and Exposures (CVEs). The study urges treating AI-produced code as untrusted and adding security checks.

This paper examines the security of code produced by Large Language Models (LLMs) for C and C++. The authors build a reproducible workflow that starts from prompts inspired by the Common Weakness Enumeration (CWE) catalogue, generates code with ten LLMs, and scans the results with static analysis tools. They map identified CWEs to Common Vulnerabilities and Exposures (CVEs) in the National Vulnerability Database to estimate real-world impact. The work is methodical, and the results are straightforward enough to make practitioners uncomfortable.

What the study did

The researchers crafted 84 prompts targeting CWEs and produced two variant outputs per model: a general Code Generator and a Secure Code Generator. That yielded 20 codebases across ten models, run either locally for open source variants or via cloud APIs for closed models. Static analysis came from three tools: CodeQL, Snyk Code and CodeShield. The team used a custom SARIF parser to aggregate findings and to map weaknesses to real CVEs, making the dataset practical for defensive work.

What it shows and what to do

The headline finding is blunt: LLM-generated C and C++ often contain well known security problems. CodeQL alone reported 38 unique CWEs across the outputs; it found 37 CWEs in general generator outputs and 29 in the secure-generator variants. Snyk Code and CodeShield flagged complementary issues, with Snyk identifying 10 CWEs in general outputs and 12 in secure ones, while CodeShield reported fewer items. Recurring patterns include unchecked return values and buffer-related weaknesses such as heap buffer overflow. Several of the identified CWEs correspond to large numbers of mapped CVEs, indicating high operational risk rather than theoretical flaws.

Coverage differs a lot between scanners, so no single tool gives full visibility. The study also notes variability across models: one model produced many flagged weaknesses while another appeared cleaner but produced fewer complete outputs. Some prompts elicited refusals or secure explanations instead of code, and non-compilable files also complicated comparisons. The authors are clear about limits: LLM outputs vary with prompt wording and random seeds, scanner rules differ, and the mapping to CVEs depends on the NVD.

For defenders the implications are direct. Treat AI-produced code as untrusted input rather than trusted automation. Integrate multiple static scanners into CI, require human review for system-level code, and enforce provenance so generated snippets are tracked and audited. Relying on a single model or a single scanner is risky.

Scan AI-generated code with multiple tools, and fail CI on high-risk CWEs.
Mandate human review for memory-safe language or low-level C and C++ changes.
Record provenance of generated snippets and apply post-generation remediation.

The study does not break new theoretical ground, but it delivers useful, evidence-based warnings. If your organisation uses LLMs to accelerate development, expect classic C and C++ bugs and plan for detection and mitigation as part of the development pipeline.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models

Authors: Muhammad Usman Shahid, Chuadhry Mujeeb Ahmed, and Rajiv Ranjan

The security of code generated by large language models (LLMs) is a significant concern, as studies indicate that such code often contains vulnerabilities and lacks essential defensive programming constructs. This work focuses on examining and evaluating the security of LLM-generated code, particularly in the context of C/C++. We categorized known vulnerabilities using the Common Weakness Enumeration (CWE) and, to study their criticality, mapped them to CVEs. We used ten different LLMs for code generation and analyzed the outputs through static analysis. The amount of CWEs present in AI-generated code is concerning. Our findings highlight the need for developers to be cautious when using LLM-generated code. This study provides valuable insights to advance automated code generation and encourage further research in this domain.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates the security of code generated by large language models in the context of C and C++ and notes that such code often contains vulnerabilities and lacks defensive programming constructs. Vulnerabilities are categorised using the MITRE Common Weakness Enumeration CWE framework and mapped to real world CVEs to assess their criticality. Ten different LLMs are used to generate code and the outputs are analysed with static analysis tools to quantify security risks. The work emphasises the need for caution when using AI generated code and aims to provide insights to improve automated code generation and security research.

Approach

The study adopts a three phase workflow consisting of Prompt Engineering, Code Generation and Code Security Analysis. Prompts are crafted from CWE baselines using MITRE CWE data and mapped to CVEs via the National Vulnerability Database. A novel dataset of 84 prompts targets CWEs for C and C++ and two code bases are produced for each model using two assistants: a Code Generator CG and a Secure Code Generator SCG, yielding 20 code bases in total. Ten LLMs are evaluated, including both open source and closed source models, with four dedicated to code generation and two lightweight models, paired with code specialized variants where applicable. Open source models are run locally via Ollama and closed source models via APIs. Static analysis tools CodeQL, Snyk Code and CodeShield are used to identify vulnerabilities, with SARIF reports produced and processed by a custom tool, SarifMiner, to map CWEs to CVEs and to generate visualisations. The methodology is designed to be general and reproducible across languages and models.

Key Findings

Across ten models the static analysis of CG and SCG codebases identified substantial vulnerability presence, with CodeQL detecting 37 unique CWEs in CG and 29 in SCG, Snyk Code 10 in CG and 12 in SCG, and CodeShield 3 in CG and 2 in SCG.
Overall CodeQL identified 38 unique CWEs across 20 codebases, encompassing 1 625 generated files, while Snyk Code and CodeShield revealed complementary patterns due to differing rule sets and coverage.
Mistral 7B produced the highest vulnerability count by CodeQL with 18 CWEs in CG and 16 in SCG, whereas Llama 2 demonstrated stronger apparent security with 11 CWEs in CG and 2 in SCG, albeit with a lower code generation completion rate (74 in SCG versus 99 in CG).
Recurring vulnerabilities include unchecked return values (CWE 252 253) and buffer related weaknesses (CWE 120 787 805); CWE 122 Heap based Buffer Overflow was notably frequent across models, and several CWEs such as CWE 119 and CWE 120 were associated with a large number of CVEs in the mapping, indicating high real world risk.
Comparisons across tools show broad coverage differences: CodeQL identified more CWEs than Snyk Code and CodeShield, and there was limited overlap among tools, underscoring the benefit of using multiple scanners for a comprehensive assessment.
Notable instances of no code generation were observed for five models, including Gemini 1.5 pro, Llama 2 7B, CodeLlama 7B, Granite Code 3B and Codestral 22B, with several no code outputs occurring in SCG prompts for Llama 2 7B and others; some prompts elicited secure explanations rather than code.
The study also reports patterns of secure code generation being inconsistent across prompts and models, with some secure prompts still producing insecure outputs, and others declining to generate code due to security concerns, illustrating the challenge of reliably safe AI assisted programming.

Limitations

The authors identify internal validity threats including the probabilistic nature of LLM outputs which can vary across runs, the influence of prompt design on results, and hardware constraints for open source models. External validity is limited by the specific tools used for vulnerability detection (CodeQL Snyk Code CodeShield) and the CWE CVE mappings via NVD; results depend on the coverage and accuracy of these tools and datasets. Non compilable files and the presence of no code generation in some prompts also affect comparative conclusions across models.

Why It Matters

The work highlights that LLM generated code in C and C++ frequently contains well known and high risk vulnerabilities, with many CWE leading to documented CVEs. The findings support the need for integrated security checks in AI assisted code generation, including automated security scans, secure by default prompting, and provenance tracking of generated code. Practically these insights advocate safer generation pipelines and post generation remediation, while societally emphasising the potential impact on critical systems and software supply chains as AI assisted coding becomes more pervasive. The study provides a structured framework and datasets to enable reproducible security evaluation across models and programming languages and suggests future research directions in prompt engineering, prompting techniques and language expansion to strengthen AI driven software production.

Attribution Original paper on arXiv