New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

RCR shows LLMs assist Active Directory pentests

Published: Wed, Mar 04, 2026 • By Clara Nyx

Pentesting

RCR shows LLMs assist Active Directory pentests

A replication report tests whether large language models can autonomously pentest Microsoft Active Directory in a controlled lab. The prototype often finds actionable leads and near-misses but struggles to close the loop. Heavy setup aside, the release gives teams a reproducible way to study AI-assisted internal attacks.

Another week, another claim that large language models can hack your network. This one at least brings tooling and traces you can check. The Replicated Computational Results report for “Can LLMs Hack Enterprise Networks?” packages a full lab and replayable logs to test whether a Large Language Model (LLM) can drive an assumed‑breach against Microsoft Active Directory.

The team built a Python prototype called cochise that runs from a Kali Linux attacker virtual machine against A Game of Active Directory, a public five‑host Windows lab. Everything sits under VMware on commodity server kit, but do not expect to run this on a laptop. The authors cite a practical floor of about 48 GB of RAM and roughly 190 GB of disk for the virtual machines, plus evaluation Microsoft licences, and you rebuild the environment after destructive runs.

The evaluation is methodical. It exercises several model setups, including OpenAI GPT‑4o, DeepSeek‑V3, Google Gemini‑2.5‑Flash, a planner and executor pairing using OpenAI o1 with GPT‑4o, and Alibaba Qwen3. Sessions run up to two hours with temperature set to zero when allowed. The run continues until saturation, defined as two consecutive samples that uncover no new leads or compromised accounts. Every prompt and command lands in timestamped JSON for later replay, and the authors ship Docker images, code, example logs and analysis scripts on public archives.

What actually worked

The headline is not that models waltz through an enterprise. The prototype can chain together non‑trivial Active Directory actions and often surfaces “actionable leads”, but many attempts stall a whisker from success. Small slips in targeting or procedure regularly turn what should be a straightforward step into a dead end. The study classifies a lot of these as “Almost‑There”. The primary metric is hard edged, the number of fully compromised user accounts, and that is where the brittleness shows.

This is both encouraging and sobering. Encouraging because the logs show real, automatable behaviour security teams can analyse without live network access. Sobering because the distance between almost there and domain compromise is not wide in many environments. If models can reliably reduce toil on enumeration, path discovery and basic execution, an operator, benign or malicious, can supply the missing judgement.

There is useful engineering here: a repeatable workflow, consistent caps on time, and a saturation rule that avoids cherry‑picking heroic single runs. The authors note that six samples per configuration were enough to hit saturation in their experiments, which should help future comparisons.

Now the caveats. This is one lab, GOAD, not your estate. It is invasive and must be rebuilt after runs. Some dependencies are finicky and tied to specific versions. Qualitative judgements lean on expert reviewers. The results do not generalise to every Active Directory topology, politics or patch level. Treat it as a wind‑tunnel test, not a street race.

Does it matter? Yes, in a measured way. The package shows that modern LLMs can materially assist internal attack automation under constraints, and it gives defenders a way to study that assistance with real traces. It also reminds vendors and security teams that policy, monitoring and safe‑use controls need to assume AI‑assisted attack workflows. The open question is reliability. Until these systems stop tripping on small variations, they are accelerants rather than autonomous intruders. That is still worth planning for.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Can LLMs Hack Enterprise Networks? -- Replicated Computational Results (RCR) Report

Authors: Andreas Happe and Jürgen Cito

This is the Replicated Computational Results (RCR) Report for the paper ``Can LLMs Hack Enterprise Networks?" The paper empirically investigates the efficacy and effectiveness of different LLMs for penetration-testing enterprise networks, i.e., Microsoft Active Directory Assumed-Breach Simulations. This RCR report describes the artifacts used in the paper, how to create an evaluation setup, and highlights the analysis scripts provided within our prototype.

🔍 ShortSpan Analysis of the Paper

Problem

This report documents the replication package and evaluation procedures for a study that asks whether large language models (LLMs) can autonomously perform penetration testing against enterprise Microsoft Active Directory networks. The question matters because automating assumed-breach simulations could reduce the cost of security testing for organisations, while also exposing dual-use risks if LLMs can meaningfully assist attackers.

Approach

The authors built an autonomous penetration-testing prototype named cochise that drives attacks from a Kali Linux attacker VM against a publicly available third-party testbed called A Game of Active Directory (GOAD). The environment uses five Windows virtual machines configured as an enterprise network and a Kali VM as attacker origin. Tests ran on commodity server hardware with multiple VMs under VMWare, requiring roughly a minimum of 48 GB RAM and about 190 GB of disk. cochise is Python based, logs every LLM interaction and executed command as timestamped JSON events, and can be run via provided Docker images. The evaluation tested five LLM configurations hosted on cloud providers or rented hardware: OpenAI GPT-4o, DeepSeek-V3, Google Gemini-2.5-Flash, a two-model Planner/Executor pairing using OpenAI o1 and GPT-4o, and Alibaba Qwen3. Experiments used a two-hour time cap per sample, temperatures were generally set to zero when possible, and runs continued until saturation — defined as two consecutive samples producing no new leads or compromised accounts. The primary quantitative success metric was the number of fully compromised user accounts; qualitative assessment captured actionable leads and near-miss behaviours. All code, JSON log data, and analysis scripts are released on GitHub and as a Zenodo artifact; replay and analysis tools are provided to inspect traces without network access.

Key Findings

The prototype can autonomously conduct complex Active Directory penetration tests and produces detailed logs suitable for replay and analysis.
LLMs frequently produced actionable leads and near-success attacks but often failed due to small procedural or targeting variations; many attempts were classified as "Almost-There" rather than fully successful.
The evaluation reached saturation after six samples per configuration in the authors' experiments, enabling systematic comparison across models under a fixed time budget.
Reasoning-capable models and hybrid Planner/Executor setups were included to probe different capabilities, and the setup supports both cloud-hosted and locally hosted LLMs.
All experimental artefacts, including preconfigured Docker images, example log files, and analysis scripts, are provided to facilitate replication and further study.

Limitations

Reproducing the setup requires nontrivial resources and configuration effort: GOAD depends on evaluation Microsoft licenses and multiple VMs that consume substantial RAM and disk. The testbed is invasive and should be rebuilt after each destructive run. Results are specific to the GOAD scenario and the selected LLM configurations; qualitative judgements relied on expert reviewers. Some tooling and dependencies required specific versions and host provisioning steps, and availability of certain third-party components changed during the work.

Why It Matters

The replication package demonstrates that modern LLMs can materially assist in automating complex internal-network attacks and provides a reusable workflow to evaluate this capability. This carries practical security implications: organisations and LLM providers should consider defensive controls, monitoring and safe-use policies to address potential AI-assisted threat scenarios. The released artifacts enable security teams and researchers to compare models, reproduce results, and develop mitigations.

Links Original paper on arXiv

RCR shows LLMs assist Active Directory pentests

What actually worked

📋 Original Paper Title and Abstract

Can LLMs Hack Enterprise Networks? -- Replicated Computational Results (RCR) Report

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

LLM agents autonomously penetrate test servers

LLM Agents Tackle Lateral Movement, Still Brittle

Difficulty-aware LLM agents lift pen test success

Related Research

Get the weekly digest