LLMs Fail to Fix Real Exploitable Bugs

Pentesting

Published: Thu, Sep 04, 2025 • By Rowan Vale

New exploit-driven testing finds that popular large language models fail to reliably repair real, exploitable Python vulnerabilities. Researchers run 23 real CVEs with working proof-of-concept exploits and show top models fix only 5 cases. The result warns that AI patches often leave attack surfaces and need exploit-aware checks before deployment.

A new paper puts AI patching to a harsh, realistic test and the results are blunt: even the best model fixes only 5 out of 23 real, PoC-verified Python vulnerabilities. That is about 22 percent success when success means the original exploit actually stops running.

A large language model is AI that predicts text from patterns in vast amounts of code and prose. From here on I use the abbreviation LLM. PoC means proof-of-concept exploit that demonstrates a vulnerability in practice.

Why this matters: many prior evaluations let patched code pass unit tests and call it a day. Those tests miss whether an exploit still works. VulnRepairEval runs the exploit in a container against the patched code and counts a fix only if the exploit fails. The headline finding is simple and worrying: LLMs often mislocate the bug or produce syntactically or semantically broken patches, and fancy prompting or multi-agent tricks barely move the needle.

Practical straight-line fixes for defenders

Minimal viable controls - never auto-deploy AI patches; require PoC failure and a human security review.
Good - add automated exploit testing in CI, run static analysis and fuzzing after patch generation.
Better - require differential containerized runs against the original PoC and add lightweight regression tests for adjacent modules.
Best - integrate an exploit-aware pipeline like VulnRepairEval, mandate human signoff, and track patch provenance in the supply chain.

Checklist before trusting an AI patch: verify PoC fails, run lint and unit tests, perform a quick dependency impact check, and do manual code review on any nontrivial change. AI can accelerate finding candidate fixes, but right now it is not a drop-in safety net for security-critical remediation.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

Authors: Weizhe Wang, Wei Ma, Qiang Hu, Yao Zhang, Jianfei Sun, Bin Wu, Yang Liu, Guangquan Xu, and Lingxiao Jiang

The adoption of Large Language Models (LLMs) for automated software vulnerability patching has shown promising outcomes on carefully curated evaluation sets. Nevertheless, existing datasets predominantly rely on superficial validation methods rather than exploit-based verification, leading to overestimated performance in security-sensitive applications. This paper introduces VulnRepairEval, an evaluation framework anchored in functional Proof-of-Concept (PoC) exploits. Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment, where repair success requires the original exploit to fail execution against the modified code. The benchmark construction involved extensive data curation: we processed over 400 CVEs and approximately 2,500 potential sources to extract a collection of authentic vulnerability instances (23 Python CVEs) amenable to automated testing with working PoCs. Through VulnRepairEval, we conduct a comprehensive evaluation of 12 popular LLMs and observe a significant performance deficit: even the top-performing model successfully addresses merely 5/23 instances (about 21.7%), exposing critical weaknesses in security-focused applications. Our failure analysis reveals that most unsuccessful attempts stem from imprecise vulnerability identification and patches containing syntactic or semantic errors. Enhanced prompting strategies and multi-agent approaches yield minimal improvements, with overall effectiveness remaining largely unaffected. This work contributes a stringent, practical evaluation framework for LLM-driven vulnerability remediation and underscores the necessity for assessment protocols that authentically reflect real-world exploitation scenarios.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates how effectively large language models can repair real software vulnerabilities, focusing on whether AI patches truly block exploits rather than merely keeping code running. It notes that existing datasets rely on superficial validation such as unit tests, which can create a false sense of safety in security sensitive contexts. The work argues for exploit based evaluation, because patches that pass basic tests may still leave the vulnerability exploitable. The Python ecosystem is highlighted due to its prevalence in AI, cloud and data workloads and the risk that patches validated only by tests do not eliminate attack surfaces.

Approach

The authors introduce VulnRepairEval, a PoC driven benchmark with a containerised end to end evaluation pipeline. A patch is deemed successful only if the original proof of concept PoC fails when run against the patched code. The benchmark integrates patch generation, PoC execution and dual container differential validation. It covers 23 real Python CVEs from 2017 to 2024 across nine vulnerability classes, with patch complexity ranging from single line edits to multi file refactoring. The data set construction involved processing more than 400 CVEs and about 2 500 potential sources to identify authentic vulnerability instances amenable to automated testing with working PoCs, resulting in 23 CVEs across 9 classes and 448 CVEs across 164 projects after filtering. Public PoCs were gathered by crawling NVD references, using an automated multi stage pipeline to classify PoC content as executable, descriptive or brief, and to verify PoCs by constructing exploits and validating their effectiveness. The evaluation runs patch generation on 12 mainstream LLMs under uniform prompts and runtimes. The pipeline uses Docker to create isolated vulnerable and patched environments and compares PoC outcomes to produce a PoC validated repair measure. The patch generation is constrained to the vulnerable files rather than entire repositories to manage context size and avoid over focusing on a single region. A formal evaluation metric set includes primary PoC validated repair success and secondary diagnostics for patch format and abstention, along with a composite ranking score for leaderboard use.

Key Findings

End to end repair is currently limited: the strongest model heals 5 of 23 CVEs (about 21.7 per cent), with average end to end repair around 7 per cent across models.
Most failures stem from vulnerability localisation errors and from patches that are syntactically or semantically incorrect, with patch format conformance also a frequent bottleneck.
Prompting aids are marginal and inconsistent; lightweight hints have limited impact while multi round agentic workflows reduce abstention but do not close the end to end gap.
Agent based approaches improve detection rates and yield more applicable patches in some cases, but multi file regressions remain a key constraint; patches are often confined to single files and fail to resolve multi file exploits.
Model family differences are evident; larger thinking models show higher repair rates than non thinking variants, and there is a clear advantage for models that externalise reasoning, although none achieve robust performance.
Vulnerability difficulty matters: Easy vulnerabilities are more susceptible to repair, while Hard vulnerabilities show near universal failure; only rare Hard cases are fixed by the strongest models.
Conservative model behaviour leads to abstention in many cases, creating a false sense of security when no patch is proposed.
Threats to validity include the single file patch constraint, restricted context windows, and the PoC failure criterion which may block exploitation detection rather than fully eliminate the vulnerability; cross language generalisation and the need for human review are emphasised.

Limitations

The framework currently operates on constrained single file patches with automated patch generation and does not yet address multi file dependencies, build systems or configuration interactions common in real projects. The PoC based success criterion, while rigorous, may block exploitable conditions if patches are brittle or if PoCs vary. The study focuses on Python and a curated CVE set; cross language generalisation remains to be demonstrated, and human review is still recommended to complement AI based patching.

Why It Matters

VulnRepairEval provides a stringent, reproducible framework that aligns evaluation with real world security practices by requiring PoC failure to validate patches. The results reveal substantial gaps in current LLM capabilities for security critical repairs and underscore the need for exploit aware evaluation pipelines, human oversight and complementary checks such as static analysis and fuzzing before deployment. The work informs defender testing strategies and supports safer deployment of automated patching, while highlighting societal and software supply chain risks associated with scaling AI driven remediation. The authors also suggest that future research should extend exploitation aware evaluation to additional languages and integrate human review to achieve robust real world applicability.

Attribution Original paper on arXiv