Difficulty-aware LLM agents lift pen test success
Pentesting
Automating penetration testing with Large Language Model (LLM) agents has looked promising, then underwhelming, depending on which demo you watched. A new study asks the basic question: why do these agents swing from single-digit success to respectable coverage, and what actually fixes it? The authors review 28 pentest-agent designs, evaluate five representative systems, and work across three benchmarks that range from capture-the-flag (CTF) tasks to a multi-host Active Directory lab. The result is a clearer picture of what fails, and what moves the needle.
They split failures into two camps. Type A issues are the obvious ones: missing tools, bad prompts, poor documentation access. You can usually engineer these away with better tool wrappers and retrieval. Type B issues survive all that. These are planning and state-management problems where the agent chases dead ends, commits too early, forgets context, and times out before finishing an attack chain. The root cause the authors see across models is simple: agents do not estimate task difficulty in real time, so they misallocate effort.
Difficulty-aware planning, not just bigger models
The proposed design pairs solid tooling with a difficulty-aware planner. A Tool and Skill Layer uses typed tool interfaces and retrieval-augmented knowledge to cover basic capability gaps. On top of that sits Task Difficulty Assessment (TDA), which estimates tractability along four signals: how long the path might be (horizon estimation), how convincing the current clues are (evidence confidence), how full the working memory is (context load), and how similar tasks have gone before (historical success). Those scores feed an Evidence-Guided Attack Tree Search (EGATS) that balances exploration and exploitation instead of blindly pushing down the first plausible branch. A memory subsystem externalises state to avoid losing track as prompts and outputs grow.
Does it work?
Across benchmarks, the approach delivers consistent gains. On CTF-style tasks, the system reaches up to 91% task completion with frontier models, a relative improvement of 39% to 49% over baselines. On the PentestGPT benchmark, it roots 12 of 13 machines. In the GOAD environment, which simulates a five-host multi-domain Active Directory, it compromises four of five hosts, compared with at most two by prior systems. Ablation results attribute quick wins on short tasks to the Tool and Skill Layer, while multi-step progress comes from TDA, EGATS, and externalised memory working together.
There are limits. The agent still struggles with truly novel exploitation that requires creative invention rather than pattern matching. Environments that mislead state can throw it off. Long-running operations that humans carry across sessions remain a challenge. TDA can prune bad branches, but it cannot conjure exploits where no documentation or prior pattern exists. Some scenarios remain unsolved.
So what? For security teams kicking the tyres on agent-driven testing, the message is that architecture matters at least as much as model choice. Ask vendors how they estimate task difficulty, how tools are typed and validated, and how state is managed outside the prompt. The evidence suggests these pieces yield real end-to-end gains across models and benchmarks, which is more useful than another model-only upgrade. The commercial implication is straightforward: difficulty-aware planning can turn erratic agents into steadier labour on routine assessments and lab environments, improving coverage per pound spent. It will not replace skilled testers on creative or deceptive targets, but it can make your human team faster and less bored. That, in security operations, is often enough.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
What Makes a Good LLM Agent for Real-world Penetration Testing?
🔍 ShortSpan Analysis of the Paper
Problem
This paper examines why large language model based agents show widely varying success when used for automated penetration testing and which failure modes persist as models improve. The work matters because automated testing promises to relieve a large shortfall in human security expertise, yet reported task completion rates vary from single digits to high fractions depending on system design and benchmark. The authors identify whether failures stem from missing tooling and prompts or from deeper planning and state-management limitations that scale alone does not fix.
Approach
The authors survey 28 LLM-based penetration testing systems and evaluate five representative open-source implementations across three benchmarks of increasing realism: XBOW (104 CTF-style web tasks), the PentestGPT Benchmark (13 HTB and VulnHub machines), and GOAD (a 5-host multi-domain Active Directory environment). They characterise failures by analysing 200 execution traces, propose an architecture called PentestGPT v2, and test it with several frontier models. PentestGPT v2 combines a Tool and Skill Layer with typed tool interfaces and retrieval-augmented knowledge, a Task Difficulty Assessment mechanism that computes a Task Difficulty Index from four measurable signals, an Evidence-Guided Attack Tree Search planner that uses those estimates, and a Memory Subsystem that externalises state to avoid context forgetting. Ablations measure each component's contribution and live deployment on a capture-the-flag competition checks real-world viability.
Key Findings
- Two distinct failure classes: Type A are capability gaps such as missing tools or documentation and are addressable by engineering; Type B are complexity barriers rooted in planning and state management and persist even with strong tooling.
- Type B failures share a common root cause: agents lack real-time task difficulty estimation, causing misallocation of effort, premature commitment to unproductive branches, exploration-exploitation imbalance, and context exhaustion.
- PentestGPT v2, which pairs strong tooling with difficulty-aware planning, improves end-to-end performance substantially: up to 91% task completion on the XBOW CTF set (49% relative improvement over baselines), roots 12 of 13 PentestGPT benchmark machines, and compromises 4 of 5 hosts on GOAD compared with at most 2 by prior systems; ablations show the Tool Layer helps short-horizon tasks while TDA-EGATS and Memory drive multi-step gains.
Limitations
Remaining constraints include failures on truly novel exploitation that require creative invention rather than pattern matching, vulnerability to adversarial or deceptive environments that can mislead state, and limits in temporal scale and cross-session continuity where human testers maintain long-running strategic context. TDA can prune intractable branches but cannot by itself solve cases where no documentation or prior pattern exists, and some benchmark scenarios remain unsolved.
Why It Matters
The analysis clarifies that simply scaling LLMs does not eliminate core navigation and planning problems in automated penetration testing. Practical systems require both robust tool integration and difficulty-aware planners that monitor horizon, evidence, context load and historical success to guide exploration and pruning. The proposed design yields consistent gains across models and benchmarks, improving coverage and cost-effectiveness and offering a concrete blueprint for more reliable automated security assessments while highlighting where human expertise remains essential.