AutoPentester Automates Red-Team Tasks, Reveals Gaps

Pentesting

Published: Wed, Oct 08, 2025 • By Natalie Kestrel

AutoPentester Automates Red-Team Tasks, Reveals Gaps

AutoPentester uses a Large Language Model (LLM) agent to automate end-to-end penetration testing and yields measurable gains versus PentestGPT. The framework raises subtask completion and vulnerability coverage while cutting human interactions, but introduces automation overhead and new risks such as prompt injection and hallucination that teams must mitigate before deployment.

AutoPentester is a new framework that probes a familiar promise: let a Large Language Model (LLM) drive the repetitive parts of penetration testing. The system chains five LLM modules to plan strategy, generate tool commands, run command-line scanners and exploitation tools, verify results and detect loops. The authors test three LLM backbones and settle on GPT-4 Turbo for the reported experiments.

What the evaluation shows

On Hack The Box scenarios AutoPentester achieves a 59.92 percent average subtask completion rate compared with 47.18 percent for PentestGPT, a relative improvement the paper reports as 27.0 percent. It also claims 39.5 percent greater vulnerability coverage with 18.7 percent fewer steps. In lab tests on custom vulnerable virtual machines AutoPentester finds 98.14 percent of known issues versus 70.37 percent for PentestGPT.

Automation reduces human touch. The framework averaged about 1.13 human interactions per machine versus 15.36 for PentestGPT. The authors also record far fewer malformed or incomplete commands, 0.1 per machine compared with 4.46, and far fewer repeated loops during a run: 0.3 loops per machine compared with 2.1. An ablation study attributes most gains to the Strategy Analyzer and the retrieval-augmented generation (RAG) knowledge base; the Repetition Identifier and Results Verifier further cut loops and incomplete commands.

Why practitioners should care — and be sceptical

Those are promising numbers if you need scalable, repeatable coverage outside normal working hours. AutoPentester orchestrates common tools such as Nmap, Metasploit, Dirbuster, Nikto, Sqlmap, John the Ripper, Smbclient, Dnsrecon and SSLscan, and it builds reports automatically. But the paper is frank about limits. The approach depends on command-line tool compatibility, a curated knowledge base and the LLM correctly interpreting tool outputs. Web application corners and GUI-only workflows remain weak spots. The system also takes longer wall-clock time to run than PentestGPT because of automation overhead and tool interaction delays.

Security risks are not hypothetical. Chaining an LLM to tooling creates new attack surfaces: prompt injection, hallucinated findings, and unexpected command generation could leak data or trigger unsafe actions. The authors recommend sandboxing, strict input/output validation, layered verification of findings, robust logging and adversarial testing, and the paper acknowledges the ethical risk that the same automation could be repurposed by adversaries.

The results are useful but not definitive. The user survey sample is modest and the experiments use curated vulnerable targets where coverage is measurable. That matters: production networks are messier than training playgrounds, and a tool that automates a checklist can still miss context-dependent, high-risk pathways that a skilled tester would spot.

If you are tempted to hand your scanners over to an agent, apply basic operational hygiene first. Run the agent in a strict sandbox, keep tool privileges minimal, and treat every automated finding as provisional until validated.

Run the agent in an isolated environment and audit all executed commands in real time.
Validate a sample of findings manually and cross-check outputs against raw tool logs before remediation.
Stress-test the agent with prompt-injection scenarios and adversarial inputs as part of your red-team routine.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

AutoPentester: An LLM Agent-based Framework for Automated Pentesting

Authors: Yasod Ginige, Akila Niroshan, Sajal Jain, and Suranga Seneviratne

Penetration testing and vulnerability assessment are essential industry practices for safeguarding computer systems. As cyber threats grow in scale and complexity, the demand for pentesting has surged, surpassing the capacity of human professionals to meet it effectively. With advances in AI, particularly Large Language Models (LLMs), there have been attempts to automate the pentesting process. However, existing tools such as PentestGPT are still semi-manual, requiring significant professional human interaction to conduct pentests. To this end, we propose a novel LLM agent-based framework, AutoPentester, which automates the pentesting process. Given a target IP, AutoPentester automatically conducts pentesting steps using common security tools in an iterative process. It can dynamically generate attack strategies based on the tool outputs from the previous iteration, mimicking the human pentester approach. We evaluate AutoPentester using Hack The Box and custom-made VMs, comparing the results with the state-of-the-art PentestGPT. Results show that AutoPentester achieves a 27.0% better subtask completion rate and 39.5% more vulnerability coverage with fewer steps. Most importantly, it requires significantly fewer human interactions and interventions compared to PentestGPT. Furthermore, we recruit a group of security industry professional volunteers for a user survey and perform a qualitative analysis to evaluate AutoPentester against industry practices and compare it with PentestGPT. On average, AutoPentester received a score of 3.93 out of 5 based on user reviews, which was 19.8% higher than PentestGPT.

🔍 ShortSpan Analysis of the Paper

Problem

Penetration testing and vulnerability assessment are essential practices for protecting computer systems, but escalating cyber threats and regulatory demands have increased the need for scalable testing beyond what available human expertise can reliably provide. Prior automated approaches based on AI or large language models (LLMs) are largely semi automated and require substantial human interaction to execute tests, interpret outputs, and generate reports. This paper introduces AutoPentester, an LLM agent based framework that aims to automate the end to end pentesting process for a given target IP, iterating through reconnaissance, scanning, vulnerability assessment and exploitation while adapting strategies based on tool outputs and prior findings.

Approach

AutoPentester is built around five LLM based modules that emulate a human pentester in an iterative workflow. The Strategy Analyzer reasons over prior results to derive an updated attack strategy using chain of thought reasoning and a findings oriented pentest tree. The RAG based Generator creates advance commands for multiple tools by retrieving relevant knowledge from a curated knowledge base and current context, enabling automated execution without human input. The Agent Computer Interface executes commands on CLI based tools such as Nmap, Metasploit, Dirbuster and others via a combination of subprocess and pexpect. The Results Verifier checks tool outputs and refines commands to improve results, and the Repetition Identifier detects looping where similar previous steps are repeated, offering options to continue, exit or switch path. A Summarizer converts tool outputs into human readable form, chunking long outputs to overcome token limits and feeding summaries to the Strategy Analyzer. AutoPentester maintains a log and ultimately generates a structured csv report of findings. The framework uses a RAG knowledge base drawn from Metasploit manuals, hands on penetration testing resources and HackTrics style articles, stored as vector embeddings, with query retrieval via cosine similarity to support command generation. Tools used include Nmap, Nikto, Metasploit, Dirbuster, John the Ripper, Sqlmap, Smbclient, Dnsrecon, and SSLscan. The system was evaluated on Hack The Box machines and custom vulnerable VMs, and three LLM backbones were tested, with GPT 4 Turbo selected for the remainder of experiments.

Key Findings

AutoPentester demonstrates superior performance over the state of the art PentestGPT across HTB machines: average subtask completion of 59.92 percent versus 47.18 percent for PentestGPT, a 27.0 percent improvement, and 39.5 percent greater vulnerability coverage with 18.7 percent fewer steps.
The framework achieves substantially fewer human interactions, averaging about 1.13 interactions per machine compared with PentestGPT's 15.36, and shows a large reduction in incomplete commands, averaging 0.1 per machine versus 4.46 for PentestGPT (97.7 percent reduction).
Loop minimisation is achieved through the Repetition Identifier, reducing loops per step by 85.7 percent (0.3 loops per machine versus 2.1 for PentestGPT).
In vulnerability and threat assessment on custom VMs, AutoPentester consistently identifies more vulnerabilities, achieving an average coverage of 98.14 percent versus 70.37 percent for PentestGPT.
An ablation study shows the Strategy Analyzer improves subtask completion by about 25 percent over a baseline, while adding the RAG module further improves completion and the Repetition Identifier and Results Verifier substantially reduce loops and incomplete commands; the combination of RAG and Repetition Identifier yields the highest performance.
A user survey with cybersecurity professionals reports an average score of 3.93 out of 5 for AutoPentester, around 19.8 percent higher than PentestGPT, with respondents noting time saving and suitability for enterprise assessments and red team drills.
Time and cost analyses indicate AutoPentester takes longer to complete tasks on average than PentestGPT due to automation overhead and tool interaction delays, but incurs only a marginally higher cost and enables testing outside standard working hours, reducing the need for manual labour.

Limitations

Fully automated operation relies on command line based interactions with tools, which can struggle with GUI driven interfaces; an interactive mode is provided to allow user guidance. The approach depends on a current and comprehensive knowledge base; gaps can lead to missed corner cases. AutoPentester may still fail to identify optimal strategies on some HTB machines, particularly for web applications requiring web focused information or external web browsing. The user study utilised a modest sample size, and results may vary with different models or tool configurations. Security risks include potential prompt injection and model hallucinations in LLM orchestrations if safety boundaries are not properly enforced, underscoring the need for sandboxing, input/output validation, layered verification, auditing and red team testing.

Why It Matters

The work demonstrates an end to end AI driven approach to automated security testing with measurable improvements in coverage and reductions in human input, supporting scalable, repeatable security validation and red team style assessments. It highlights practical security implications of relying on LLM based orchestration, including risks of prompt injection and hallucinations, and the importance of mitigations such as sandboxing, strict input validation, layered verification, robust auditing and red teaming. Societal and security impacts include accelerated vulnerability discovery and reduced human effort, but also potential misuse by adversaries to automate attacks; governance, ethics and consent are therefore important in deploying automated testing tools in production. The authors suggest future directions including fine tuning LLMs to learn pentesting strategies, applying reinforcement learning and human feedback, and adding GUI based tools and web focused security tools to broaden automation while maintaining safety.

Attribution Original paper on arXiv