LLM agents struggle to reproduce web vulnerabilities

Agents

Published: Fri, Oct 17, 2025 • By Rowan Vale

A first large study tests 20 LLM agents on automating web vulnerability reproduction and finds limited end-to-end success. Agents turn reports into proof of concept code for simple library flaws but fail on complex, multi-component services and authentication hurdles. Defenders should prioritise environment simulation, authentication controls and monitoring of agent activity.

Large Language Model (LLM) means an AI that predicts and generates text; these models underpin the agents evaluated in this study. The paper runs the first broad, empirical test of 20 LLM-driven agents on the task of turning vulnerability reports into working exploits, and it asks a practical question: can we hand off reproduction to automation?

The short answer is no, not reliably. The researchers test three representative vulnerabilities first, then expand to a benchmark of 80 real-world CVEs covering seven vulnerability types and six web technologies. They pair three top agents named OpenHands, SWE-agent and CAI with three foundation models called GPT 4 1, Claude Sonnet 4 and Gemini 2 5 Pro. Constraints mirror a realistic lab: Dockerised environments, no browser automation, 60 minute timeouts and limited oracle budgets per stage.

Results are blunt. End-to-end reproduction success sits below 25 per cent overall. The best configuration achieves 22.5 per cent end-to-end success (OpenHands with Claude Sonnet 4). OpenHands with GPT 4 1 reaches 20 per cent and OpenHands with Gemini 2 5 Pro reaches 13.8 per cent. Environment setup is uneven: CAI shows the most consistent setup success, OpenHands is thorough but costly in time and tokens, and SWE-agent trails.

The study separates stages for clarity. Agents often generate plausible proof of concept (PoC) code; OpenHands plus Claude Sonnet 4 runs PoC code successfully 70.0 per cent of the time. Where the chain breaks is triggering the actual vulnerability in a running service. Trigger rates are low, between 8.8 and 21.3 per cent, and fall sharply when authentication context is missing. When human-provided authentication tokens are available, some configurations hit 67 per cent; autonomous login and incomplete credentials degrade performance by roughly a third or more.

Not all flaws are equal. Library or single-package issues such as prototype pollution see high reproduction rates (63 per cent in the benchmark), while service-based flaws including SSRF, RCE and SQL injection are much harder; SQL injection reproductions register at zero in this dataset. Web stack matters too: PHP scenarios fare better than TypeScript in these tests.

What defenders should do now

The paper points to three defence priorities: accurate environment simulation so threat validation requires a close staging match to production, robust authentication handling that does not leak tokens into reports, and strict guardrails on any tool that can generate exploit code. The authors also flag governance and safety controls to limit misuse as automation improves.

Practical, tiered controls are straightforward. Good: enforce human review of any PoC produced by an agent and restrict agent access to production credentials. Better: require reproducibility checks in isolated staging containers that mirror production services and log agent actions. Best: combine strict tool-level sandboxing, mandatory attestation for any reproduced exploit, and continuous monitoring of agent inputs and outputs.

The study is a snapshot. It documents clear strengths—agents can assist in PoC generation and file-level localisation—and clear limits—agents fail on complex, multi-component, authenticated reproductions. For security teams that worry about offensive automation, the takeaway is simple: lock down environments, tighten authentication and treat agent-produced exploits as starting points, not finished work.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

Authors: Bin Liu, Yanjie Zhao, Guoai Xu, and Haoyu Wang

Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored application is automated web vulnerability reproduction, which transforms vulnerability reports into working exploits. Although recent advances suggest promising potential, challenges remain in applying LLM agents to real-world web vulnerability reproduction scenarios. In this paper, we present the first comprehensive evaluation of state-of-the-art LLM agents for automated web vulnerability reproduction. We systematically assess 20 agents from software engineering, cybersecurity, and general domains across 16 dimensions, including technical capabilities, environment adaptability, and user experience factors, on 3 representative web vulnerabilities. Based on the results, we select three top-performing agents (OpenHands, SWE-agent, and CAI) for in-depth evaluation on our benchmark dataset of 80 real-world CVEs spanning 7 vulnerability types and 6 web technologies. Our results reveal that while LLM agents achieve reasonable success on simple library-based vulnerabilities, they consistently fail on complex service-based vulnerabilities requiring multi-component environments. Complex environment configurations and authentication barriers create a gap where agents can execute exploit code but fail to trigger actual vulnerabilities. We observe high sensitivity to input guidance, with performance degrading by over 33% under incomplete authentication information. Our findings highlight the significant gap between current LLM agent capabilities and the demands of reliable automated vulnerability reproduction, emphasizing the need for advances in environmental adaptation and autonomous problem-solving capabilities.

🔍 ShortSpan Analysis of the Paper

Problem

This study examines automated web vulnerability reproduction using large language model LLM agents. Reproducing vulnerabilities from incident reports into working proof of concept PoC exploits is crucial for threat validation and patching, yet real world scenarios involve complex multi component environments, dependency management, authentication barriers and deployment constraints that challenge automation. The work provides the first comprehensive empirical evaluation of state of the art LLM agents for automated web vulnerability reproduction, aiming to understand what current agents can and cannot do in turning vulnerability reports into working exploits.

Approach

The researchers evaluate 20 representative LLM agents from software engineering cybersecurity and general domains across 16 core dimensions on 3 representative CVEs. They select three top performing agents OpenHands SWE agent and CAI for in depth analysis on a constructed benchmark of 80 real world CVEs spanning 7 vulnerability types and 6 web technologies. A comprehensive dataset is built with complete reproduction environments including Docker containers and PoC exploits with automated verification oracles. An end to end evaluation framework is established with a standard input output interface and tasks that range from environment setup vulnerability localisation PoC generation and end to end reproduction. The experiments utilise three foundation LLMs GPT 4 1 Claude Sonnet 4 and Gemini 2 5 Pro and impose constraints such as a 60 minute timeout and budgets of 2 per stage and 5 for end to end tasks. Hardware is a standard Linux environment with Docker and Python 3 11. Browser based interactions are prohibited to ensure fair comparison.

Key Findings

Overall effectiveness in end to end vulnerability reproduction is limited with end to end success rates below 25 per cent; the best combination OpenHands with Claude Sonnet 4 achieves 22 5 per cent End to End Success at three, with OpenHands plus GPT 4 1 reaching 20 per cent and OpenHands plus Gemini 2 5 Pro reaching 13 8 per cent; other configurations perform worse, and even the best results show substantial room for improvement.
Environment setup varies by agent; CAI shows the most consistent setup success across models with rates up to 38 8 per cent, OpenHands 26 3 to 32 5 per cent, and SWE agent 20 0 to 25 0 per cent.
Vulnerability localisation reveals a performance hierarchy; CAI with GPT 4 1 leads with file level 58 8 per cent function level 48 8 per cent and line level 35 0 per cent, while GPT 4 1 generally outperforms other models and Gemini 2 5 Pro shows weaker results across agents.
PoC generation is a strength for OpenHands especially with Claude Sonnet 4 where PoC execution reaches 70 0 per cent; however there is a substantial gap from PoC execution to triggering the vulnerability with trigger rates ranging from 8 8 to 21 3 per cent; OpenHands plus Claude Sonnet 4 achieves the highest trigger rate at 21 3 per cent.
Model specific patterns emerge; GPT 4 1 excels in localisation while Claude Sonnet 4 demonstrates strong synergy with OpenHands for PoC generation and end to end orchestration; Gemini 2 5 Pro generally underperforms across most metrics.
Tool usage influences performance; OpenHands employs dedicated workflow control tools that enhance memory and decision making across stages, while CAI relies on a broad tool suite but many tools are under utilised; output integrity and structured JSON outputs are strongest for OpenHands indicating robust instruction following.
Vulnerability type compatibility shows large variation; CSRF around 60 per cent success for service based vulnerabilities in tested scenarios, Path Traversal 36 per cent XSS 16 per cent SSRF 15 per cent RCE 8 per cent SQL injection 0 per cent; Prototype Pollution a library based case reaches 63 per cent providing a strong contrast to service based results.
Web technology compatibility trends show PHP achieving higher end to end success (38 9 per cent) while TypeScript performs worst (11 1 per cent); languages influence agent adaptation with web oriented languages showing better compatibility.
Efficiency measures reveal end to end costs of between 1 68 and 2 19 per successful reproduction with times of 25 7 to 34 0 minutes; environment setup is the most resource intensive stage and CAI is most cost efficient while OpenHands is most costly due to its thorough reasoning overhead.
Authentication context has a large impact; manual tokens during auth yield the highest success rate for OpenHands with Claude Sonnet 4 at 67 per cent, autonomous login reduces success by about a third to between 25 and 50 per cent, and removing authentication help collapses results to around 8 per cent in some cases indicating substantial reliance on explicit authentication guidance.
Overall, the findings indicate a substantial gap between current LLM agent capabilities and the practical demands of reliable automated vulnerability reproduction, with the leading systems performing well on simple library based flaws but failing on complex multi component service based vulnerabilities where authentication and environment configuration are critical.

Limitations

The study notes internal threats such as potential data leakage from training datasets which could inflate performance if CVEs seen during training are exploited and external validity concerns including dataset representativeness and the containerised evaluation environment differing from production deployments. The authors acknowledge that the binary success measures may miss nuances of the reproduction process, and that rapid advances in LLMs mean results reflect a snapshot rather than enduring capability across models. They also recognise that the vulnerability selection and environment configurations may not capture the full distribution of real world security scenarios.

Why It Matters

The work demonstrates the current limits and potential of LLM agents for automated vulnerability reproduction. It suggests practical defence focus areas such as accurate environment simulation robust authentication handling and guardrails around automated exploit generation. The results highlight that improvements in environmental adaptation and autonomous problem solving are needed for reliable end to end reproduction and that governance and safety controls are essential to prevent misuse as automation accelerates offensive and defensive cyber capabilities.

Attribution Original paper on arXiv