LLM agents autonomously penetrate test servers

Pentesting

Published: Mon, Jun 15, 2026 • By Theo Solander

LLM agents autonomously penetrate test servers

A new study tests whether Large Language Model (LLM) agents can independently break into real services. Using 300 Dockerised targets seeded with 30 remote code execution CVEs, success rates range from 10.7% to 69.3%. Results track with general model capability, and tool use lets agents exploit post‑cutoff bugs via Metasploit.

Pentesters have seen this film before. The first time Nessus scripts felt like cheating. The Metasploit “autopwn” phase where enumeration clicked into execution. And, further back, the Morris worm’s blunt lesson that automation does not need to be clever to be consequential. Today’s rerun swaps the shell scripts for Large Language Models (LLMs) wired to tools, and the pattern holds: once discovery and exploitation can be chained, initial access starts to scale.

This work sets out to measure that chain end to end. The authors built 300 Dockerised targets from 30 real remote code execution CVEs, each target padded with benign services to create noise: Tier 1 with one secure service, Tier 2 with three. An LLM agent, given only a target IP and its own IP, gets a basic toolkit via a Model Context Protocol: Nmap for enumeration, WhatWeb for fingerprinting, and Metasploit for exploitation. No hints about versions or which door is open. Success means holding an interactive shell.

Across 19 models, autonomous penetration lands between 10.7% and 69.3%. That range is not hand‑waving; success tracks with general model strength, with a strong correlation to overall capability (roughly r = 0.886 on Tier 1, 0.830 on Tier 2). Adding background services dents performance by only about 7.3% on average. In other words, the agents can sift signal from noise well enough to find the weak service and press it.

Two findings jump out for practitioners. First, tool integration bridges the LLM’s knowledge gap. Agents invoked Metasploit modules to exploit CVEs disclosed after the model’s training cut‑off. That feels very 2000s “exploit pack,” but with a reasoning loop driving the selection. Second, most failures are not for lack of target weakness but for orchestration stumbles: about 42% due to poor tool use by the model and about 46% due to the limits of the provided tools. The human analogue would be a junior tester miskeying args or missing a module, not misreading the network.

History’s rhyme is clear. When we made reconnaissance cheap, scanning became background noise. When we made exploitation modular, compromise spread faster than patching. Here, the agent glues the two together with just enough planning to keep going. The study stops at first shell, and the targets lack active defence, honeypots or enterprise topology. The open question is not whether agents can chain post‑exploitation next, but how quickly capability rises as models and toolchains improve. If the past is any guide, we tend to industrialise whatever works.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Authors: Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu, Weibing Wang, Yawen Duan, Brian Tse, Geng Hong, Xudong Pan, Yuan Zhang, and Min Yang

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether large language model powered AI systems can autonomously perform end-to-end penetration of remote servers without human intervention, and why that matters. Autonomous penetration is defined as an LLM, using external tools, scanning, identifying, and exploiting a vulnerable service to obtain shell access. This capability is a core enabling step for higher-impact cyberattacks and is considered a critical safety red line for frontier AI systems.

Approach

The authors build a reproducible evaluation framework composed of controlled target servers and a lightweight, general-purpose agent scaffolding. Targets: 300 Docker-deployed servers built from 30 real CVE vulnerabilities that enable remote code execution, each combined with benign services to create two complexity tiers: Tier 1 (one vulnerable plus one secure service) and Tier 2 (one vulnerable plus three secure services). Agent scaffolding: a standard agent with a thinking module, a memory module using procedural summaries, and a tools module exposing general cybersecurity tools via a Model Context Protocol. Tools include Nmap, WhatWeb and Metasploit; the agent is given only the target IP and the local IP, not service names, versions or task-specific hints. The experiment runs 19 open-weight and proprietary LLMs across model families and scales, with each model–target pair run three times under time and step budgets. Success is defined as maintaining an interactive shell on the target.

Key Findings

Non‑trivial end‑to‑end success: evaluated systems achieved penetration success rates from 10.7% to 69.3% across the 300 targets.
Correlation with model capability: penetration success strongly correlates with general LLM capability (Pearson r ≈ 0.886 for Tier 1 and r ≈ 0.830 for Tier 2), indicating improvements in frontier models drive rising misuse potential.
Limited effect of background noise: adding three secure services (Tier 2) reduced average success by only about 7.3% versus Tier 1, showing capable agents can perform enumeration and still find exploitable services.
Tool-enabled exploitation of unseen CVEs: agents exploited vulnerabilities disclosed after a model's knowledge cutoff by invoking Metasploit modules, demonstrating that external tools can bridge a model’s knowledge gap.
Primary failure causes: analysis of failures shows improper tool usage by the LLM accounts for about 42% of failures and insufficient capability of provided tools accounts for about 46%, while incorrect target selection and secondary issues make up the rest.

Limitations

The evaluation focuses on initial shell acquisition in controlled Docker networks and does not measure post‑exploitation activities such as lateral movement or privilege escalation. Real‑world conditions such as honeypots, active defence, enterprise network layout and detection mechanisms were not modelled. Time and step budgets, toolset selection, and the use of Metasploit constrain generalisability. The released scaffolding is dual use and was shared under a responsible disclosure policy.

Implications

From an offensive security perspective, these results show that current LLM‑driven agents can autonomously discover and exploit remote services to obtain a foothold, and that increasing model capability and tool integration will likely expand that reach. Attackers could use such agents to automate initial access at scale, chain automated post‑exploitation for lateral movement, and leverage existing exploitation frameworks to act on vulnerabilities the model itself has not been trained on. The findings indicate an urgent need to consider governance and risk models for AI systems that can autonomously conduct penetration operations.

Links Original paper on arXiv

LLM agents autonomously penetrate test servers

📋 Original Paper Title and Abstract

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

RCR shows LLMs assist Active Directory pentests

Difficulty-aware LLM agents lift pen test success

LLM agents struggle to reproduce web vulnerabilities

Related Research

Get the Weekly AI Security Digest