New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

Adversarial context steers LLM code toward vulnerabilities

Published: Wed, Jun 10, 2026 • By Rowan Vale

Defenses

Adversarial context steers LLM code toward vulnerabilities

New research shows context-based prompt attacks push Large Language Models to generate insecure code. Across 2,800 trials, vulnerability rates jump from 3.5% to 37.4%. Direct instructions break GPT-3.5-Turbo 100% of the time, and attacks transfer across models. A dual-layer detector catches 89.1% with 0.3% false positives at 520 ms.

AI code assistants don’t need compromised training data to go wrong. This study shows you can poison their immediate context with the right comments, variable names, or example snippets and they will happily write vulnerable code. Think docs, templates, or internal wikis that developers copy into the editor; the Large Language Model (LLM) reads that context and drifts toward unsafe patterns.

How the attack lands

The team ran 2,800 controlled trials across CodeT5+, CodeLlama-7B, GPT-3.5-Turbo and GPT-4, targeting five CWE classes: SQL injection, cross-site scripting, hardcoded credentials, path traversal and insecure crypto. They tried four prompt conditions: baseline, direct instruction, semantic priming and example-based vulnerable snippets. Output auditing used three static analysers, AST/regex checks and a manual review of 15% of samples.

The numbers bite. Adversarial context lifts the mean vulnerability generation rate from 3.5% to 37.4% — a 10.7× jump. Direct instructions are the blunt instrument and work best overall, averaging 55% attack success; GPT-3.5-Turbo hits 100% under direct instructions. Example-based cues still land at 31.4%. Semantic priming is weaker at 17.5% but varies a lot by model.

Placement matters. Context dropped 10–50 tokens before the target function is most potent, hitting 62.1% success, which the authors attribute to recency in attention. SQL injection shows up most often at about 36.6%. Certain phrasings help: authoritative imperatives and “legacy-justification” language score 58.3% and 51.7% respectively. In plain terms, tell the model to “do it this insecure way because performance” and it often will.

Transferability is the worrying bit. Prompts crafted for one model often work on others: open-source to open-source transfers run at 95–100% on average, open-source to commercial average 65–82%, and GPT-3.5 to GPT-4 is around 90%. One tainted README or snippet can trip multiple code assistants across an organisation.

Defence that actually runs in an IDE

They propose a dual-layer defence: scan the prompt context for risky patterns, then scan the generated code for vulnerable constructs. On their held-out set it detects 89.1% of attacks with a 0.3% false positive rate and 520 ms mean latency, which is fast enough to sit inline in a developer workflow.

There are caveats. Tests covered Python and JavaScript only, with a 512-token context window, and smaller samples for commercial models. The defence is point-in-time; adaptive adversaries may find gaps. But the core result stands: context is an attack surface, and models across families exhibit the same pull toward unsafe outputs when you shape that context just so.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications

Authors: Walther A. Del Orbe, John D. Hastings, and Varghese Vaidyan

AI-powered code generation systems have transformed software development but introduce critical inference-time security vulnerabilities. This research presents a systematic investigation of context-based adversarial attacks, where strategically crafted contextual inputs, including comments, documentation, variable names, bias large language models toward generating exploitable code. Through 2,800 controlled experiments across CodeT5+, CodeLlama, GPT-3.5-Turbo, and GPT-4, we quantify attack effectiveness and defense mechanisms. Results demonstrate that adversarial conditions increase vulnerability generation 10.7x (from 3.5% to 37.4%), with direct instruction attacks achieving 100% success on GPT-3.5-Turbo. Cross-model transferability reaches 60-100%, indicating systemic architectural vulnerabilities rather than model-specific flaws. Our dual-layer defense framework achieves 89.1% detection rate with 0.3% false positives and 520ms latency, demonstrating practical feasibility for real-time deployment in development environments.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies inference-time context-based adversarial attacks against AI code generators and why they matter for software security. Rather than manipulating training data, an attacker subtly alters context supplied to a code generator—comments, documentation, variable names or example snippets—to bias the model into producing insecure or exploitable code. Given widespread adoption of AI-assisted coding, such contextual manipulations could act as a practical supply-chain vector that propagates vulnerabilities into production systems with little attacker effort.

Approach

The authors ran 2,800 controlled trials across four models: CodeT5+, CodeLlama-7B, GPT-3.5-Turbo and GPT-4, using default inference settings. They tested five vulnerability classes mapped to CWEs: SQL injection, cross-site scripting, hardcoded credentials, path traversal and insecure cryptography. Four prompt conditions were used: baseline, direct instruction, semantic priming and example-based vulnerable snippets. Outputs were analysed by a three-stage pipeline combining three static analysers, AST/regex pattern checks and manual review of a stratified 15% sample. Cross-model transferability was measured by verbatim re‑use of adversarial prompts on other models. Statistical tests and bootstrap confidence intervals were pre-specified; experimental limits included API cost constraints for commercial models and a 512-token context window.

Key Findings

Adversarial context raised mean vulnerability generation rate (VGR) from 3.5% to 37.4%, a 10.7× increase (statistically significant).
Direct instruction attacks had the highest attack success rate (ASR) overall (mean 55%); GPT-3.5-Turbo reached 100% ASR under direct instructions.
Example-based attacks achieved 31.4% ASR; semantic priming was least effective (17.5% ASR) but showed high model variance.
Context placed in the pre-function zone (10–50 tokens before the target) produced the highest ASR at 62.1%, attributed to recency bias in attention.
SQL injection was the most common adversarial VGR (approx. 36.6%); differences across vulnerability categories were significant.
High-risk linguistic patterns such as authoritative imperatives and legacy-justification phrasing scored high ASR (58.3% and 51.7% respectively).
Cross-model transferability was high: open-source to open-source transfers were 95–100% (mean TR≈0.975); open-source to commercial averaged 65–82% (mean TR≈0.738); GPT-3.5 to GPT-4 transfer was ≈90%.
The proposed dual-layer defence (prompt-level and code-level analysis) detected 89.1% of attacks with a 0.3% false positive rate and mean end-to-end latency of 520 ms, suitable for real-time integration.

Limitations

Experiments covered only Python and JavaScript and models available as of December 2024; results may not generalise to other languages or future model releases. Context windows were limited to 512 tokens so positional effects could differ in longer contexts. Commercial model samples were smaller due to API cost. The defence was evaluated on held-out data and may be evaded by adaptive adversaries; longitudinal evaluation is needed.

Implications

Offensively, an attacker can embed seemingly legitimate but malicious phrases or vulnerable example snippets into third-party documentation or code examples to steer many code generators toward insecure implementations. Placement matters: placing cues immediately before the target function and using authoritative or legacy-justification language substantially increases success. High cross-model transferability implies a single compromised documentation artefact could cause vulnerabilities across multiple code-generation systems. Example-based and direct-instruction techniques enable straightforward reuse by adversaries who lack access to model internals, making inference-time supply‑chain manipulation a practical threat at scale.

Links Original paper on arXiv

Adversarial context steers LLM code toward vulnerabilities

How the attack lands

Defence that actually runs in an IDE

📋 Original Paper Title and Abstract

Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Agents Leak Secrets via Web Search Tools

Red team shows LLM agents hide injected actions

Formal checks find exploitable flaws in LLM code

Related Research

Get the weekly digest