AI agents match pen testers, expose new risks

Pentesting

Published: Sat, Dec 13, 2025 • By Lydia Stratus

AI agents match pen testers, expose new risks

A live comparison pits autonomous AI agents against ten human penetration testers on a large university network. The ARTEMIS multi agent scaffold places second, finding nine valid vulnerabilities with an 82% validity rate and clear cost advantages. The study flags higher false positives, GUI automation gaps, and new dual use concerns.

Researchers ran the first comprehensive live comparison of autonomous AI agents and human penetration testers on a large university network of roughly 8,000 hosts across 12 subnets. The multi agent scaffold called ARTEMIS performed near the top, submitting nine validated vulnerabilities with an 82% valid submission rate and placing second overall. Some ARTEMIS variants ran at about $18 per hour, which compares with roughly $60 per hour for human testers in the same exercise.

What the test shows

ARTEMIS is a three part design: a supervisor that plans, a swarm of dynamic sub agents that run enumeration and exploitation tasks, and a triage module that verifies findings before submission. The framework demonstrates practical strengths you need to recognise: systematic enumeration at scale, parallel exploitation of many targets, and cost efficiency over short engagements. Existing agent scaffolds such as Codex and CyAgent generally underperformed relative to most human participants, while ARTEMIS matched technical depth from the strongest human performers.

That performance comes with visible limits. The agents produced more false positives than humans and struggled with GUI driven tasks. In one example the agents exploited a CLI path to an old IDRAC server that a GUI focused tester missed. Agents also differed in pacing: some signalled completion after a short, intense run while humans spread effort across more targets. The test window was compressed and defenders were aware of the exercise, so results are not a perfect proxy for long, stealthy operations.

Diagram in words: supervisor -> sub-agents (scan, enumerate, exploit, validate) -> triage module -> submission. That flow is useful to map to operational risks: if an attacker reproduces that pipeline the speed and scale of discovery rises dramatically.

Practical risks and mitigations

Two realities matter for defenders. First, the same agent design that helps defenders can be used by attackers to automate discovery and exploitation. Second, imperfect agent outputs mean noisy alerts and wasted analyst time unless you gate them. Put bluntly, you need robust human in the loop and technical controls before you let any autonomous offensive tool loose inside your estate.

Mitigations you can apply immediately include stepwise containment, strict validation, and improved automation hygiene. Contain automated testing to isolated networks or short lived lab VMs. Require a named human reviewer for any submission before it propagates to ticketing or remediation workflows. Log every agent action with replayable context and store prompts and intermediate outputs for audit.

Short checklist:

Enforce human review and reproducibility before remediation actions.
Run agents inside strict containment and credential vaults; never expose production secrets.
Collect comprehensive logs, prompts, and artefacts for triage and audit.

The study is cautiously optimistic: autonomous agents can approach human performance on many tasks, but they are not ready to replace skilled analysts. Expect improved tooling to close some gaps, and expect attackers to adopt the same ideas. Your ops playbook should prioritise validation, logging, prompt safety and containment. Treat these agents like powerful new tools that demand additional process and oversight rather than instant efficiency gains.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Authors: Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, and Daniel E. Ho

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

🔍 ShortSpan Analysis of the Paper

Problem

The study conducts the first comprehensive live comparison between AI agents and cybersecurity professionals in a real enterprise environment. The target is a large university network encompassing about 8 000 hosts across 12 subnets, with the aim of evaluating whether autonomous AI agents can perform real world penetration testing with technical depth, systematic enumeration, parallel exploitation, and cost advantages, while identifying current gaps that limit deployment in production security workflows.

Approach

Ten cybersecurity professionals were tested alongside six existing AI agents and ARTEMIS, a new agent scaffold, in a live production environment. ARTEMIS is a multi agent framework comprising a supervisor, a swarm of dynamic sub agents, and a triage module for vulnerability verification. The study employed a unified scoring framework combining technical complexity and vulnerability criticality, aligned with industry standards and MITRE ATT&CK techniques. Two ARTEMIS configurations were evaluated: A1A1 with GPT 5 for supervisor and sub agents, and A2A2 with an ensemble of supervisor models and a mix of sub agents. All agents used the same university VM as human participants and operated under safeguards including real time monitoring, scope controls, and reporting channels. The engagement spanned up to 16 hours for ARTEMIS and up to 10 hours for human participants. Data collection included structured submissions, session logs, and a detailed vulnerability triage process to reduce noise and false positives. The environment incorporated standard risk controls, and participants provided informed consent with monitoring and de identification measures in place. ARTEMIS utilised a three component design with a task list and dynamically generated prompts for sub agents, and a triage module to validate relevance, reproducibility, and severity before submission.

Key Findings

ARTEMIS placed second overall, discovering nine valid vulnerabilities with an 82 percent valid submission rate, outperforming nine of ten human participants.
Existing agent scaffolds such as Codex and CyAgent generally underperformed relative to most human participants, while ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest performers.
AI agents showed advantages in systematic enumeration, parallel exploitation, and cost. Some ARTEMIS variants cost around 18 per hour compared with about 60 per hour for professional testers.
Key capability gaps emerged for AI agents: higher false positive rates and difficulty with GUI based tasks, though CLI based interactions allowed some exploitation opportunities that GUI oriented humans missed, such as an older IDRAC server vulnerability exploited with curl.
Across participants, 49 total validated vulnerabilities were found, with the number per participant ranging from three to thirteen. Human findings tended to be dispersed across targets, while ARTEMIS often signalled completion after a shorter active period, reflecting differences in pacing and focus between humans and agents.
Two ARTEMIS configurations were able to match or exceed human performance on many tasks, but gaps remained in long horizon planning and high level strategic pivoting compared with human analysts. ARTEMIS demonstrated strong long horizon capacity by maintaining multiple concurrent sub agents while summarising progress for ongoing work.
Cost to performance was a practical differentiator: A1A1 incurred 291.47 in total cost (around 18.21 per hour) for the engagement, while A2A2 cost 944.07 (about 59 per hour), with similar vulnerability counts; this places ARTEMIS as cost competitive against conventional penetration testers when scaled over time.

Limitations

The study had a compressed time frame with human participants limited to ten hours of active engagement and four days of system access, whereas real world pen tests typically run longer. Authentic defensive conditions were not present: IT staff were aware of the test and could intervene to prevent disruption. The sample size was limited, reducing statistical power for broad generalisation. The authors acknowledge the need for runnable environment replicas, ablation studies over configurations and models, and extended logging to support longer, reproducible evaluations. Safety measures included real time monitoring and containment controls to prevent scope creep or harmful actions.

Why It Matters

The findings indicate that autonomous AI agents can conduct real world security assessments with systematic enumeration, parallel exploitation, and cost advantages, approaching the performance of skilled human practitioners on many tasks. However, substantial gaps remain in false positive management and GUI interactions, highlighting where safe deployment and further research are required. The results also raise concerns about dual use and the potential for attackers to leverage similar autonomous AI agents for rapid breaches; they emphasise the importance of human in the loop, robust validation, and rigorous auditing. Practical mitigations include strong validation of AI outputs, enhanced GUI and automation reliability, comprehensive logging, safe prompt design, and strict containment controls to govern autonomous operation.

Societal and security implications include the potential automation shift in security work and evolving defender attacker dynamics. Governance, safety rails, and responsible deployment are essential as AI enabled security tools become more capable and accessible. The authors advocate open access to ARTEMIS style tooling to support defender access while continuing to address the gaps identified in autonomous offensive security research.

Attribution Original paper on arXiv