Formal checks find exploitable flaws in LLM code
Defenses
We keep wiring Large Language Models (LLMs) into build systems as if they’re junior developers who never sleep. This study stops the vibes and runs the maths. Across 3,500 generated code artifacts covering five Common Weakness Enumeration (CWE) areas, 55.8% contain vulnerabilities. Not vague “might be risky” flags either: the pipeline feeds candidate sites into the Z3 solver and extracts satisfiability witnesses. That means concrete inputs that make the bug bite.
The spread across models is ugly. The best showing still lands a grade D at 48.4% vulnerable outputs. The worst hits 62.4%. Integer arithmetic prompts are a bloodbath at 87%, followed by memory allocation at 67%. The pattern is familiar: overflow in allocation math, signed and unsigned conversions that wrap or widen the wrong way. The authors validated a sample under GCC AddressSanitizer; six of seven Z3‑proven cases crashed in practice. They also show input‑handling failures that yield SQL injection and authentication or crypto missteps that allow weak password recovery via precomputed lookup.
What the proofs give the attacker
A solver witness is the difference between a hunch and a working exploit. If an allocator size calculation overflows to a small buffer, the witness tells you the exact length values to force a heap write out of bounds. Signedness slip? You get the input that flips a check from safe to unsafe. For SQL injection, you get parameters that exfiltrate full records. No fuzzing lottery, no guesswork at 3am.
Why our controls fail
Stacking common static tools on the pipeline does not help much here. Six popular tools combined flagged 7.6% of artifacts and missed 97.8% of the Z3‑proven findings. Notably, heavyweight C analysis and CodeQL did not catch the formally proven integer‑overflow allocation defects. “Be secure” prompting barely moves the needle, shaving only four points off the average vulnerability rate. So the two crutches teams lean on today — stricter prompts and off‑the‑shelf scanners — leave a wide attack surface untouched.
The oddest bit is the generation–review split. When asked to review their own vulnerable outputs, models identified 78.7% of the Z3‑proven defects. Yet in generation mode they still emit them at high rates. The knowledge is in there; the decoding policy is not using it when it counts.
Scope caveats apply: five CWE classes, 500 prompts, temperature zero, and validation on selected runtime harnesses. But the direction of travel is clear. If your pipeline accepts LLM code by default, and your guardrails are conventional linters or a “security mode” prompt, you are shipping proofs of exploit, not just style nits. The open question is how to make generation apply what review already knows — or failing that, how to put a prover between the model and your main branch.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code
🔍 ShortSpan Analysis of the Paper
Problem
This paper measures how often state-of-the-art large language models produce exploitable security vulnerabilities when asked to generate production code for security‑sensitive tasks. The question matters because AI coding assistants are widely used in web backends, embedded systems and cryptographic code, yet previous analyses relied on pattern matching or manual inspection and did not establish whether flagged defects are provably exploitable.
Approach
The authors generated 3,500 code artifacts by prompting seven frontier LLMs on a benchmark of 500 natural‑language prompts (100 each for five CWE categories: memory allocation and indexing, integer arithmetic, authentication, cryptography, and input handling). Models were queried deterministically (temperature 0) and instructed to produce runnable code. Each artifact was analysed with the COBALT pipeline, which identifies candidate sites, encodes vulnerability conditions as Z3 SMT formulas, and extracts concrete satisfiability witnesses when available. A subset of findings was validated with runtime proof‑of‑concept harnesses compiled under GCC AddressSanitizer. Three auxiliary experiments used a secure prompt ablation, a comparison against six industry static analysis tools, and a model self‑review test.
Key Findings
- Across all models the mean vulnerability rate was 55.8% of generated artifacts; 1,055 findings were formally proven exploitable by Z3 satisfiability witnesses.
- The best model still performed poorly: Gemini 2.5 Flash had a 48.4% vulnerability rate (grade D); GPT‑4o performed worst at 62.4% (grade F). No model achieved grade C or better.
- Vulnerability rates varied by category: integer arithmetic prompts produced the highest failure rate at 87%, followed by memory allocation at 67%—issues were dominated by integer overflow in allocation arithmetic and signed/unsigned conversion errors.
- Runtime confirmation: 6 of 7 representative Z3‑proven vulnerabilities produced real memory‑corruption crashes under AddressSanitizer; examples included SQL injection extracting full records and weak password hashing enabling recovery of a short password using a precomputed lookup.
- Security prompting had minimal effect: explicit security instructions reduced the mean vulnerability rate by only four percentage points in a 50‑prompt ablation, leaving most models at failing grades.
- Industry tooling missed almost all formally proven issues: six combined tools flagged 7.6% of artifacts and failed to detect 97.8% of Z3‑proven findings; CodeQL and heavyweight C analyzers detected none of the formally proven integer‑overflow allocation defects.
- Generation–review asymmetry: when presented with their own vulnerable outputs, models identified 78.7% of Z3‑proven defects in review mode but still generate them by default at high rates, indicating knowledge is present but not applied during generation.
Limitations
The benchmark covers a bounded set of 500 prompts and five CWE classes and was run at temperature 0; results reflect model behaviour at the time of study and should not be naively generalised to all tasks, temperatures, multi‑turn interactions or other programming languages. Full ablation across the full corpus and multi‑language targets are left for future work.
Implications
Offensive security implications are clear: concrete witness values produced by Z3 give an attacker inputs that reliably trigger exploitable conditions such as integer overflows leading to heap corruption, SQL injection yielding data exfiltration, or weak hashing enabling rapid credential recovery. Because common static tools and query engines miss the vast majority of Z3‑proven defects, automated pipelines that accept model outputs without formal verification or manual audit are vulnerable to supply of exploitable code. The generation–review asymmetry means an attacker or malicious actor could reuse model outputs or the witnesses to craft reliable exploits even where the model could have flagged the issue in review mode.