Audit Reveals LLMs Spit Out Malicious Code

Pentesting

Published: Wed, Sep 03, 2025 • By Dr. Marcus Halden

Audit Reveals LLMs Spit Out Malicious Code

A scalable audit finds production LLMs sometimes generate code containing scam URLs even from harmless prompts. Testing four models, researchers see about 4.2 percent of programs include malicious links and identify 177 innocuous prompts that trigger harmful outputs across all models. This suggests training data poisoning is a practical, deployable risk.

This study runs a forensic-style audit on four production language models and finds a worrying pattern: models can and do generate code with scam links, and they sometimes do it in response to perfectly normal developer prompts. Across the experiments the average rate of programs containing malicious URLs is roughly 4.2 percent, and 177 seemingly harmless prompts trigger malicious outputs on every model tested. Think of your code generator as a helpful intern that occasionally brings back a phishing flyer.

How they test this is straightforward and mechanical. Researchers turn known scam-domain lists into plain developer-style prompts, ask models to write code, automatically scan outputs for known-bad URLs, and then double-check hits by hand. The approach scales, which means this is not an isolated fluke in one model or one dataset. It also survives different sampling settings, so it is robust to the usual randomness tricks.

Why this matters is practical: code-generation workflows can silently introduce harmful URLs into projects, opening windows for malware, phishing, and supply-chain compromise. The discovery suggests training-data poisoning has spread into production systems enough to be exploitable at scale.

Operational takeaways

Run post-generation scans for URLs and flag unknown domains before merging code.
Sandbox any generated code and avoid automatic execution.
Apply URL reputation checks and multi-detector ensembles on outputs.
Audit training data provenance and push for poisoning defenses.
Keep human review in the loop for critical or external-facing code.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Poisoned at Scale: A Scalable Audit Uncovers Hidden Scam Endpoints in Production LLMs

Authors: Zhiyang Chen, Tara Saba, Xun Deng, Xujie Si, and Fan Long

Large Language Models (LLMs) have become critical to modern software development, but their reliance on internet datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To evaluate this threat, this paper introduces a scalable, automated audit framework that synthesizes innocuous, developer-style prompts from known scam databases to query production LLMs and determine if they generate code containing harmful URLs. We conducted a large-scale evaluation across four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), and found a systemic vulnerability, with all tested models generating malicious code at a non-negligible rate. On average, 4.2\% of programs generated in our experiments contained malicious URLs. Crucially, this malicious code is often generated in response to benign prompts. We manually validate the prompts which cause all four LLMs to generate malicious code, and resulting in 177 innocuous prompts that trigger all models to produce harmful outputs. These results provide strong empirical evidence that the training data of production LLMs has been successfully poisoned at scale, underscoring the urgent need for more robust defense mechanisms and post-generation safety checks to mitigate the propagation of hidden security threats.

🔍 ShortSpan Analysis of the Paper

Problem

Large Language Models have become critical to modern software development, but their training on internet data risks absorbing and reproducing malicious content. This study asks whether production LLMs generate malicious code, including code containing harmful URLs, and why this matters for security evaluators because it reveals scalable hidden risk surfaces and data poisoning vulnerabilities in real world AI systems. The authors show that production LLMs can emit malicious code even when given seemingly innocuous prompts, underscoring the need for stronger data provenance, post generation safety checks and robust defence mechanisms to prevent the propagation of hidden security threats.

Approach

The authors present a scalable automated audit framework that uses known scam URL databases to synthesize innocuous developer prompts and query production LLMs to determine if generated code contains malicious URLs. The evaluation covers four production LLMs GPT 4o GPT 4o mini Llama 4 Scout and Deepseek V3. The framework comprises four stages: (1) malicious URL collection from two scam databases eth phishing detect and phishing Fort (2) prompt synthesis by a Prompt LLM to produce concise prompts that resemble benign coding requests yet are unstable and likely to elicit malicious code (3) code generation and URL extraction by a Codegen LLM and (4) oracle and human based verification to confirm maliciousness and assemble a dataset of innocuous prompt malcious code pairs. An ensemble of detectors ChainPatrol Google Safe Browsing and SecLookup flags malicious URLs and a human adjudication step validates prompts. The resulting S set contains innocuous prompts that elicit malicious code, serving as a benchmark for auditing LLMs and measuring real world risk. The study also discusses sampling settings and model pairings to demonstrate cross model vulnerability and reports on the publicly released prompts and evaluation results as benchmarks.

Key Findings

Across four production LLMs the average rate of code containing malicious URLs is 4.24 per cent, with the rate varying by prompt and code generator pairing.
The pairing of gpt 4o mini as the prompt generator and gpt 4o as the code generator produced the highest observed rate of malicious programs at 5.94 per cent, while llama 4 scout prompt with deepseek v3 code generation yielded 3.19 per cent.
When analysing only the generated URLs, 12 per cent are malicious on average, peaking at 17.60 per cent for the gpt 4o mini prompt and gpt 4o code generation combination. The results include 1 877 unique malicious domains identified by the gpt 4o mini prompt and gpt 4o code generation pairing.
Across model combinations 2 029 malicious domains are identified by all four models, indicating substantial domain level convergence across diverse training data sources and highlighting that malicious domains can be widely encountered regardless of the organisation behind the model.
Individual models generate substantial unique malicious URLs and show substantial overlap at the domain level; the all model intersection at the domain level accounts for about 60 per cent of the domains on average, indicating a broad exposure to common malicious content across models.
177 innocuous prompts consistently trigger malicious outputs across all four models, derived from 191 prompts after filtering that bypassed direct seed domain duplication; manual validation confirms that 93.2 per cent of these 177 prompts are legitimate innocuous developer requests.
When testing with a higher sampling temperature (0 8), malicious program rates remain substantial ranging from 4.19 to 5.09 per cent across model pairs, demonstrating that the vulnerability is robust to sampling strategy.
The study publically releases its prompts and evaluation results to support further research and the development of safer code generation workflows.

Limitations

Limitations include constrained oracle coverage as not all malicious links appear in the existing scam databases; detection depends on a multi detector ensemble with varying coverage and agreement, which introduces inconsistencies; the analysis focuses on malicious code generation by LLMs without involving external tools such as search engines or plugins; the framework targets malicious URLs as the primary malicious signal and may underestimate other forms of malicious code; some prompts may inadvertently contain seed URL fragments, requiring filtering to distinguish model driven behaviour from instruction following.

Why It Matters

The findings demonstrate a scalable method to audit AI agents for hidden risk surfaces and data poisoning vulnerabilities in production pipelines, emphasising that code generation workflows can be coerced into unsafe outputs even from benign prompts. Practical mitigations include stronger data provenance and poisoning defenses, post generation safety checks, input and output filtering, URL and code sanitisation, sandboxed execution and model auditing to catch unsafe outputs before deployment. The work also highlights broad societal security implications such as malware distribution and supply chain compromise via AI enabled software development, reinforcing the need for robust safeguards as AI systems scale in critical applications.

Attribution Original paper on arXiv