New Benchmark Shows AI Pentesters Fail Real Targets

Pentesting

Published: Fri, Sep 12, 2025 • By Elise Veyron

New Benchmark Shows AI Pentesters Fail Real Targets

A new real-world benchmark and agent, TermiBench and TermiAgent, test AI-driven penetration tools beyond toy capture-the-flag setups. The research shows most existing agents struggle to obtain system shells, while TermiAgent improves success with memory-focused reasoning and structured exploit packaging. This raises practical security concerns and governance questions for defenders and policy makers.

Penetration testing finds vulnerabilities before bad actors do. But until now many AI-driven pentesting systems were measured in artificial capture-the-flag settings that hand the agent helpful hints. TermiBench changes that. It runs 510 hosts across 25 services and 30 real CVEs and asks agents to achieve a true system shell, not just grab a planted flag. That makes the test close to what defenders actually face.

The headline finding is blunt: off-the-shelf agents largely fail in realistic conditions. TermiAgent improves matters by focusing on two practical problems. First, memory and context matter. Located Memory Activation helps the agent pick relevant past observations so it does not forget what it learned minutes earlier. Second, exploit packaging matters. A structured Arsenal Module and Unified Exploit Descriptor let the agent apply tools reliably instead of copy-pasting fragile proof-of-concept code. The result is higher success, lower time, and cheaper runs even on laptop-scale setups.

This matters because better tooling helps defenders close gaps, but it also lowers the bar for misuse. Trade-offs are clear: more automation reduces cost and human effort but raises dual-use risk and may encourage checkbox compliance rather than real hardening. Practical steps this quarter: run realistic benchmark tests on isolated networks, add benign noise to mimic production, evaluate small LLMs for controlled testing, and require exploit descriptors for any automated tests. Later, invest in governance: procurement standards for AI security tools, continuous monitoring, and tighter access controls for automated pentesting components. Performance metrics are useful, but realism and governance are what make those metrics meaningful.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

Authors: Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, and Min Yang

Penetration testing is critical for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time-consuming, and dependent on expert human labor. Recent work has explored AI-driven pentesting agents, but their evaluation relies on oversimplified capture-the-flag (CTF) settings that embed prior knowledge and reduce complexity, leading to performance estimates far from real-world practice. We close this gap by introducing the first real-world, agent-oriented pentesting benchmark, TermiBench, which shifts the goal from 'flag finding' to achieving full system control. The benchmark spans 510 hosts across 25 services and 30 CVEs, with realistic environments that require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using this benchmark, we find that existing systems can hardly obtain system shells under realistic conditions. To address these challenges, we propose TermiAgent, a multi-agent penetration testing framework. TermiAgent mitigates long-context forgetting with a Located Memory Activation mechanism and builds a reliable exploit arsenal via structured code understanding rather than naive retrieval. In evaluations, our work outperforms state-of-the-art agents, exhibiting stronger penetration testing capability, reducing execution time and financial cost, and demonstrating practicality even on laptop-scale deployments. Our work delivers both the first open-source benchmark for real-world autonomous pentesting and a novel agent framework that establishes a milestone for AI-driven penetration testing.

🔍 ShortSpan Analysis of the Paper

Problem

Penetration testing is essential for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time consuming and reliant on expert human labour. AI driven pentesting agents have been evaluated mainly in oversimplified capture the flag environments that embed prior knowledge and do not reflect real world practice. This work introduces TermiBench as the first real world, agent oriented pentesting benchmark and TermiAgent as a multi agent framework to achieve full system control in realistic environments. TermiBench spans 510 hosts across 25 services and 30 CVEs; environments require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using TermiBench, existing systems struggle to obtain system shells under realistic conditions.

Approach

TermiBench is founded on real world fidelity, blind evaluation and systematic service integration. Targets include 510 hosts with 30 CVEs across 25 services; benign background services are added to create noise; evaluation covers configurations from a base 1 plus 0 to complex 7 plus 1 setups. The objective is to obtain a remote shell or root access, with no prior knowledge beyond the target subnet. The TermiAgent framework manages real world end to end pentesting with five modules the Reasoner Module the Assistant Module the Executor Module the Memory Module and the Arsenal Module. A Located Memory Activation mechanism selects memories relevant to the current decision, and memory is compressed at varying levels to match task requirements. Exploits are packaged into a Unified Exploit Descriptor that describes environmental and operational dimensions enabling Dockerized environments and concise manuals. The Arsenal Module integrates in the wild exploits and open source tools to TermiAgent as plug and play modules. The system runs on Kali hosts via LangGraph and supports several back end LLMs including lightweight models to enable laptop scale deployment.

Key Findings

TermiAgent solves about 1.7 times as many capture the flag challenges and over 8 times as many real world pentesting tasks compared with state of the art agents.
In real world settings TermiAgent achieves success on more than half of target hosts with all tested back end LLMs, while competing agents struggle with VulnBot failing to cope and PentestGPT failing to compromise any target.
TermiAgent reduces execution time by less than one fifth and financial cost by around one tenth in real world tasks, aided by Located Memory Activation which filters irrelevant context and by the Penetration Memory Tree to prioritise exploitation paths.
The Arsenal Module and the Unified Exploit Descriptor are shown to be critical; ablation of the Arsenal reduces success by about 29.66 per cent and removing Located Memory Activation reduces success by about 66.95 per cent, underscoring the importance of memory handling and robust exploit packaging.
TermiBench comprises 510 targets with 30 CVEs across 25 services, with benign background services increasing environmental complexity; results show that even small back end LLMs can drive effective end to end pentesting, with a 4B model achieving substantial penetration rates.
Compared with Metasploit TermiAgent covers around 1.8 times more RCE CVEs in its arsenal and environment reconstruction times for packaging range 33 to 84 seconds, indicating practical viability for automated tasks.

Limitations

TermiAgent has limited capabilities in complex web based pentesting that require interaction with HTML elements or file uploads and in depth response analyses. The Arsenal Module cannot yet handle some very large monolithic code bases or highly fragmented PoCs and PoCs with interactive external network service requirements. TermiBench focuses on initial ownership of a target rather than post exploitation stages such as privilege escalation or lateral movement. Access to TermiAgent and the Arsenal Module is restricted and some components are not publicly released. All experiments were conducted in isolated environments with ethical safeguards.

Why It Matters

TermiBench provides a realistic, open standard for evaluating autonomous pentesting agents in multi service environments enabling rigorous benchmarking and risk assessment of AI driven security tooling. TermiAgent demonstrates how memory management and structured exploit packaging can improve reliability and efficiency making end to end automated pentesting feasible on consumer hardware. The work highlights dual use risks as automated attack agents can lower barriers to compromise stressing the need for defensive strategies governance and responsible use. The results show that real world evaluation realism matters for trustworthy AI security tooling and can inform how organisations design testing regimes and security workforce needs.

Attribution Original paper on arXiv