Planner-led Agents Boost Automated Penetration Testing

Pentesting

Published: Sun, Dec 14, 2025 • By Adrian Calder

Planner-led Agents Boost Automated Penetration Testing

Researchers evaluate Large Language Model (LLM) agents for automated penetration testing and introduce CHECKMATE, which pairs an explicit classical planner with LLM executors. CHECKMATE raises benchmark success rates by over 20 percent, halves time and cost versus the best agent baseline, and improves stability while exposing remaining limits in long-horizon planning and tool use.

Automated penetration testing is getting a serious upgrade, and the paper under discussion shows why mixing old-school planning with modern language models is more than a novelty. The researchers frame the problem with a Planner-Executor-Perceptor, or PEP, design to split reasoning, action and observation into separate parts. Large Language Model (LLM) agents remain useful, but the PEP split exposes where they trip up: long-horizon plans, complex reasoning and reliable use of specialised tools.

What they did

The authors build a system called CHECKMATE that pairs a classical planner with LLM-driven executors and a perceptor that translates tool output back into planner-friendly state. Classical planning+ is an extension to handle partially observable, non-deterministic effects, while predefined attack actions give the LLM concrete, vetted commands for tools such as Metasploit modules, Nmap Scripting Engine (NSE) scripts and Nuclei templates. The paper evaluates CHECKMATE on the Vulhub benchmark: 120 containerised targets and 11 milestones used to measure penetration progress. They also compare against several baselines and three out-of-the-box agent stacks, noting that Claude Code teamed with Sonnet 4.5 is the strongest standalone agent observed.

The headline results are concrete. CHECKMATE raises benchmark success rates by over 20 percent compared with the best agent baseline, and it cuts both time and monetary cost by more than half. The paper reports an average execution cost of 0.68 dollars across twenty tasks and an average run time of 7.75 minutes. An ablation study shows that swapping out core components causes substantial drops in performance, which underlines that the architecture and the chosen agent pairings both matter.

Why it matters and what to watch

This is a useful step towards practical, automated testing workflows. The planner supplies a stable, concise brain that avoids the verbosity and drift you often get when an LLM tries to invent a multi-step strategy on the fly. In practice that means fewer redundant actions, faster runs and cheaper cloud bills. The trade-offs are clear: CHECKMATE relies on predefined actions and a planner model, which improves reliability but may limit generalisation to novel tools or highly dynamic environments. The Vulhub testbed is also limited to single-application scenarios, so results do not fully cover lateral movement, complex privilege escalation, or credential harvesting in sprawling networks.

There is also a broader risk equation. Faster automated discovery aids defenders by finding and fixing bugs sooner, but the same capabilities, left unchecked, could enable scalable offensive automation. The authors flag governance, access controls and monitoring as essential; that is not optional technobabble, it is a practical safeguard.

If you manage pen testing programmes, start by treating CHECKMATE as an instructive pattern rather than a turnkey threat or defence. Adopt structured action libraries for any automation you allow, log and audit planner decisions, and test on representative multi-host environments before trusting results at scale. Researchers can take this further by addressing non-text data, visual interfaces and multimodal planning, but for now the main lesson is simple: marry an explicit planner to an LLM executor and you get something that behaves more like a disciplined operator and less like a chatty amateur.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Automated Penetration Testing with LLM Agents and Classical Planning

Authors: Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan LI, and Yan Chen

While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner-Executor-Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out-of-the-box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing specialized tools. These limitations significantly constrain its overall capability, efficiency, and stability. To address these limitations, we propose CHECKMATE, a framework that integrates enhanced classical planning with LLM agents, providing an external, structured "brain" that mitigates the inherent weaknesses of LLM agents. Our evaluation shows that CHECKMATE outperforms the state-of-the-art system (Claude Code) in penetration capability, improving benchmark success rates by over 20%. In addition, it delivers substantially greater stability, cutting both time and monetary costs by more than 50%.

🔍 ShortSpan Analysis of the Paper

Problem

Penetration testing is essential for proactively identifying vulnerabilities, yet achieving fully automated hands-off pentesting remains a key challenge. The paper presents the Planner-Executor-Perceptor (PEP) design paradigm to systematise this area and uses it to review existing work and identify core challenges. It also evaluates current pentesting systems with a focus on large language model (LLM) agents. The authors discuss the strengths and limits of automated attackers, emphasise the potential and risks of autonomous LLM driven attacks, and note the societal and security implications of speeding vulnerability discovery alongside safeguards to prevent scalable exploitation.

Approach

The authors propose CHECKMATE, a framework that couples enhanced classical planning with LLM agents to provide an external planning brain that mitigates weaknesses in LLM based systems. They describe a unified Planner-Executor-Perceptor (PEP) paradigm and decompose pentesting into three components: a planner, an executor, and a perceptor. Classical planning+ is introduced to handle dynamic updates and non deterministic effects in partially observable environments, extending traditional planning with LLM assisted state and action effect updates. Predefined attack actions expand the LLMs knowledge of specialised tools such as Metasploit modules, NSE scripts, and Nuclei templates, improving command reliability and reducing hallucinations. The planner uses explicit preconditions and effects to model causal relationships, while the executor carries out actions via an LLM driven, structured prompt guided by the planner. The perceptor translates heterogeneous outputs into planning predicates, enabling iterative updates. The system is evaluated on the Vulhub dataset of 120 containers with 11 milestones to measure progress, and is contrasted against four baselines and three out of the box LLM agents (Claude Code plus Sonnet 4.5, Codex plus o4 mini, and Gemini Code Assist plus Gemini Pro 2.5). The evaluation uses minimal human intervention and a stall threshold of two hours per step. They also compare against RAG based approaches and JSON planning structures to assess the value of their design choices.

Key Findings

Claude Code combined with Sonnet 4.5 shows the strongest out of the box pentesting capability among evaluated baselines, substantially outperforming prior systems in most milestones.
LLM agents exhibit key limitations including difficulty maintaining coherent long horizon plans, performing complex reasoning, and effectively using specialised tools, which constrains efficiency and stability of automated pentesting.
CheckMate outperforms the state of the art Claude Code in penetration capability on the Vulhub benchmark, raising benchmark success rates by over 20 percent and delivering considerably greater stability.
CheckMate reduces both time and monetary costs by more than 50 percent relative to Claude Code, with an average total cost of $0.68 over twenty tasks and an average execution time of 7.75 minutes, driven by the symbolic planning framework that avoids verbose in plan generation by the LLM.
An ablation study shows that removing components (for example replacing Claude Code with another agent or replacing Sonnet 4.5 with GPT o4 mini) causes substantial performance drops, confirming the importance of the Claude Code control and Sonnet 4.5 capabilities.
Compared with RAG based knowledge expansion and a structured JSON planning file approach, classical planning+ yields the largest gains in efficiency and consistency, while still enabling end to end automation.
A detailed case study on an Apache ActiveMQ target shows CheckMate solving the task in three steps, whereas Claude Code required many more steps, including redundant or misdirected actions, illustrating the benefits of explicit planning and predefined actions for reliability and speed.
The evaluation design uses 120 Vulhub containers, 11 milestones, and 20 tasks for efficiency and cost assessment, highlighting that while CheckMate advances capability, some milestones such as lateral movement, privilege escalation and credential leakage may not be reached in the Vulhub single application setting.

Limitations

The study relies on the Vulhub dataset which simulates single application vulnerabilities, so higher milestones such as lateral movement and credential access may not be fully representative of real world breadth. While CheckMate improves planning reliability and efficiency, the framework still depends on predefined actions and planner driven updates, which may limit generalisation to scenarios with unseen tools or highly dynamic environments. The authors acknowledge open questions including representation of actions and states for complex pentesting, handling non text data and visual interfaces, and integrating future multimodal capabilities. They also note broader concerns about the dual use of automated pentesting technologies and the need for safeguards, governance, and responsible deployment in practice.

Why It Matters

The work demonstrates that combining LLM driven attack agents with a classical planner can automate substantial portions of penetration testing, improving reliability, planning over long horizons, and tool use while reducing time and cost. The external planning brain provided by CHECKMATE offers measurable gains in success rates and stability, suggesting practical benefits for rapid vulnerability discovery and system hardening. However the results also underscore risks that autonomous attackers could be misused if safeguards are not in place, emphasising the importance of governance, access controls, and monitoring to prevent harmful exploitation. The framework also highlights ongoing research directions for representation of actions and states, integration of visual and multimodal data, and ensuring safer, auditable autonomous security testing workflows.

Attribution Original paper on arXiv