ShortSpan.ai logo

Difficulty-aware LLM agents lift pen test success

Pentesting
Published: Fri, Feb 20, 2026 • By Adrian Calder
Difficulty-aware LLM agents lift pen test success
New research dissects why Large Language Model (LLM) agents often stall in automated penetration testing. It separates fixable tooling gaps from deeper planning failures, then shows difficulty-aware planning improves end-to-end results. Reported gains include up to 91% CTF task completion and better performance on an Active Directory lab than prior systems.

Automating penetration testing with Large Language Model (LLM) agents has looked promising, then underwhelming, depending on which demo you watched. A new study asks the basic question: why do these agents swing from single-digit success to respectable coverage, and what actually fixes it? The authors review 28 pentest-agent designs, evaluate five representative systems, and work across three benchmarks that range from capture-the-flag (CTF) tasks to a multi-host Active Directory lab. The result is a clearer picture of what fails, and what moves the needle.

They split failures into two camps. Type A issues are the obvious ones: missing tools, bad prompts, poor documentation access. You can usually engineer these away with better tool wrappers and retrieval. Type B issues survive all that. These are planning and state-management problems where the agent chases dead ends, commits too early, forgets context, and times out before finishing an attack chain. The root cause the authors see across models is simple: agents do not estimate task difficulty in real time, so they misallocate effort.

Difficulty-aware planning, not just bigger models

The proposed design pairs solid tooling with a difficulty-aware planner. A Tool and Skill Layer uses typed tool interfaces and retrieval-augmented knowledge to cover basic capability gaps. On top of that sits Task Difficulty Assessment (TDA), which estimates tractability along four signals: how long the path might be (horizon estimation), how convincing the current clues are (evidence confidence), how full the working memory is (context load), and how similar tasks have gone before (historical success). Those scores feed an Evidence-Guided Attack Tree Search (EGATS) that balances exploration and exploitation instead of blindly pushing down the first plausible branch. A memory subsystem externalises state to avoid losing track as prompts and outputs grow.

Does it work?

Across benchmarks, the approach delivers consistent gains. On CTF-style tasks, the system reaches up to 91% task completion with frontier models, a relative improvement of 39% to 49% over baselines. On the PentestGPT benchmark, it roots 12 of 13 machines. In the GOAD environment, which simulates a five-host multi-domain Active Directory, it compromises four of five hosts, compared with at most two by prior systems. Ablation results attribute quick wins on short tasks to the Tool and Skill Layer, while multi-step progress comes from TDA, EGATS, and externalised memory working together.

There are limits. The agent still struggles with truly novel exploitation that requires creative invention rather than pattern matching. Environments that mislead state can throw it off. Long-running operations that humans carry across sessions remain a challenge. TDA can prune bad branches, but it cannot conjure exploits where no documentation or prior pattern exists. Some scenarios remain unsolved.

So what? For security teams kicking the tyres on agent-driven testing, the message is that architecture matters at least as much as model choice. Ask vendors how they estimate task difficulty, how tools are typed and validated, and how state is managed outside the prompt. The evidence suggests these pieces yield real end-to-end gains across models and benchmarks, which is more useful than another model-only upgrade. The commercial implication is straightforward: difficulty-aware planning can turn erratic agents into steadier labour on routine assessments and lab environments, improving coverage per pound spent. It will not replace skilled testers on creative or deceptive targets, but it can make your human team faster and less bored. That, in security operations, is often enough.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

What Makes a Good LLM Agent for Real-world Penetration Testing?

Authors: Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang
LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation. As a result, agents misallocate effort, over-commit to low-value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing agent that couples strong tooling with difficulty-aware planning. A Tool and Skill Layer eliminates Type A failures through typed interfaces and retrieval-augmented knowledge. A Task Difficulty Assessment (TDA) mechanism addresses Type B failures by estimating tractability through four measurable dimensions (horizon estimation, evidence confidence, context load, and historical success) and uses these estimates to guide exploration-exploitation decisions within an Evidence-Guided Attack Tree Search (EGATS) framework. Excalibur achieves up to 91% task completion on CTF benchmarks with frontier models (39 to 49% relative improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 by prior systems. These results show that difficulty-aware planning yields consistent end-to-end gains across models and addresses a limitation that model scaling alone does not eliminate.

🔍 ShortSpan Analysis of the Paper

Problem

This paper examines why large language model based agents show widely varying success when used for automated penetration testing and which failure modes persist as models improve. The work matters because automated testing promises to relieve a large shortfall in human security expertise, yet reported task completion rates vary from single digits to high fractions depending on system design and benchmark. The authors identify whether failures stem from missing tooling and prompts or from deeper planning and state-management limitations that scale alone does not fix.

Approach

The authors survey 28 LLM-based penetration testing systems and evaluate five representative open-source implementations across three benchmarks of increasing realism: XBOW (104 CTF-style web tasks), the PentestGPT Benchmark (13 HTB and VulnHub machines), and GOAD (a 5-host multi-domain Active Directory environment). They characterise failures by analysing 200 execution traces, propose an architecture called PentestGPT v2, and test it with several frontier models. PentestGPT v2 combines a Tool and Skill Layer with typed tool interfaces and retrieval-augmented knowledge, a Task Difficulty Assessment mechanism that computes a Task Difficulty Index from four measurable signals, an Evidence-Guided Attack Tree Search planner that uses those estimates, and a Memory Subsystem that externalises state to avoid context forgetting. Ablations measure each component's contribution and live deployment on a capture-the-flag competition checks real-world viability.

Key Findings

  • Two distinct failure classes: Type A are capability gaps such as missing tools or documentation and are addressable by engineering; Type B are complexity barriers rooted in planning and state management and persist even with strong tooling.
  • Type B failures share a common root cause: agents lack real-time task difficulty estimation, causing misallocation of effort, premature commitment to unproductive branches, exploration-exploitation imbalance, and context exhaustion.
  • PentestGPT v2, which pairs strong tooling with difficulty-aware planning, improves end-to-end performance substantially: up to 91% task completion on the XBOW CTF set (49% relative improvement over baselines), roots 12 of 13 PentestGPT benchmark machines, and compromises 4 of 5 hosts on GOAD compared with at most 2 by prior systems; ablations show the Tool Layer helps short-horizon tasks while TDA-EGATS and Memory drive multi-step gains.

Limitations

Remaining constraints include failures on truly novel exploitation that require creative invention rather than pattern matching, vulnerability to adversarial or deceptive environments that can mislead state, and limits in temporal scale and cross-session continuity where human testers maintain long-running strategic context. TDA can prune intractable branches but cannot by itself solve cases where no documentation or prior pattern exists, and some benchmark scenarios remain unsolved.

Why It Matters

The analysis clarifies that simply scaling LLMs does not eliminate core navigation and planning problems in automated penetration testing. Practical systems require both robust tool integration and difficulty-aware planners that monitor horizon, evidence, context load and historical success to guide exploration and pruning. The proposed design yields consistent gains across models and benchmarks, improving coverage and cost-effectiveness and offering a concrete blueprint for more reliable automated security assessments while highlighting where human expertise remains essential.


Related Articles

Related Research on arXiv

Get the Monthly AI Security Digest

Top research and analysis delivered to your inbox once a month. No spam, unsubscribe anytime.

Subscribe