ShortSpan.ai logo

LLM Agents Tackle Lateral Movement, Still Brittle

Pentesting
Published: Fri, May 08, 2026 • By Marcus Halden
LLM Agents Tackle Lateral Movement, Still Brittle
A new study puts language model agents to work on lateral movement in a Windows Active Directory testbed. Expert-supplied plans beat autonomous runs, but failure and waste are common. The standout finding: clever scaffolding helps, yet command brittleness and messy credential handling still derail end-to-end success.

Red teams are starting to hand the clipboard to Large Language Model (LLM) agents and ask a blunt question: can they actually run multi-step intrusions without babysitting. This paper takes that challenge seriously, focusing on lateral movement in a controlled Windows Active Directory environment and mapping the work to MITRE ATT&CK so the units of progress are clear.

How they tested it

The authors model two lateral-movement scenarios as ordered task chains, one with nine tasks and one with ten. Each task has explicit validation predicates. An LLM-as-a-Judge verifies outcomes, so an agent does not get credit for hand-waving; it must produce artefacts the judge can check. I like this a lot: it turns vague agent success into yes or no based on evidence.

They split responsibilities across three roles: an Orchestrator agent for planning, a Cyber Agent that issues commands, and a Judge agent that evaluates telemetry. Runs compare three modes: expert-defined action plans, self-scaffolded plans built by the agent, and fully autonomous operation. Experiments span five different LLMs under a fixed token budget with instrumentation for retries, premature task jumps and looping.

What broke, what worked

Expert-defined plans were the most reliable. In this mode, Claude Sonnet 4.5, GPT-5.1 and Claude Opus 4.5 each completed the full nine-task chain. Gemini 3 Pro stalled at task three; DeepSeek v3.2 failed at the first step. The result fits a pattern many of us have seen: give agents a strong scaffold and they do real work; remove it and variance explodes.

Failures clustered around three pain points. First, brittle command invocation and poor execution context hygiene. Second, identity and credential transitions that went sideways, especially when moving between accounts or machines. Third, ambiguous verification where partial telemetry looked like success. The traces will feel familiar: recurrent PowerShell use, LSASS access attempts, Pass-the-Hash, and writable share abuse.

The cost of getting stuck was not subtle. One environment setup task needed four retries. A reverse-shell deployment took nine attempts, burning about 11 minutes and roughly 53k tokens. In a fully autonomous run, a credential-dumping step chewed through about 2.6×10^5 tokens without producing verifiable artefacts. Agents also advanced on weak signals, reusing bad credentials or assuming privilege that the judge had not granted, yielding end states that looked tidy but lacked the right intermediate evidence.

Methodologically, the LLM-as-a-Judge approach is the star here. It corrals hand-wavy agent claims into deterministic checks, even if noisy under partial logs. The authors also report best runs per configuration, which is an upper bound. That is fine by me: it sets a realistic ceiling today and a clear benchmark to beat.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Autonomous Adversary: Red-Teaming in the age of LLM

Authors: Mohammad Mamun, Mohamed Gaber, Scott Buffett, and Sherif Saad
Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary emulation, and the orchestration of multi-step activity such as lateral movement, a core enabling capability of advanced persistent threat (APT) campaigns. Using frameworks such as MITRE ATT&CK, we analyze where these agents intersect with core offensive functions and assess current strengths and limitations of LMAs with an emphasis on governance and realistic evaluation. We benchmark LMAs across two lateral-movement scenarios in a controlled adversary-emulation environment, where LMAs interact with instrumented cyber agents, observe execution artifacts, and iteratively adapt based on environmental feedback. Each scenario is formalized as an ordered task chain with explicit validation predicates, leveraging an LLM-as-a-Judge paradigm to ensure deterministic outcome verification. We compare three operational modalities: fully autonomous execution, self-scaffolded planning, and expert-defined action plans. Preliminary findings indicate that expert-defined action plans yield higher task-completion rates relative to other operational modes. However, failure remains frequent across all modalities, largely attributable to brittle command invocation, environmental and deployment instability, and recurring errors in credential management and state handling.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines how Language Model Agents (LMAs) can augment offensive cyber operations, focusing on lateral movement, a multi-step capability central to advanced persistent threat activity. It asks how reliably LMAs can plan, execute and verify chained intrusion steps in a realistic enterprise Active Directory environment, and what failure modes and governance challenges arise when agents act autonomously or with varying degrees of scaffolding.

Approach

The authors implement a controlled multi-host Windows Active Directory testbed and formalise two lateral‑movement scenarios as ordered task chains with explicit verification predicates (Scenario‑1: nine tasks; Scenario‑2: ten tasks). A three-agent architecture separates planning (Orchestrator LMA), execution (Cyber Agent) and evaluation (Judge LMA using an LLM-as-a-Judge). They compare three operational modalities: expert-defined action plans, self-scaffolded planning by the agent, and fully autonomous operation. Experiments ran across five large language models under a fixed token budget per run and instrumented feedback to measure partial progress, retries and behavioural signals such as premature progression and loss‑of‑control loops.

Key Findings

  • Expert-defined plans produced the highest task‑completion rates: Claude Sonnet 4.5, GPT‑5.1 and Claude Opus 4.5 each completed the full nine‑task chain in the expert‑defined mode, while Gemini 3 Pro stalled at task 3 and DeepSeek v3.2 stalled at task 1.
  • Failures remained frequent across modes and models, driven by three recurring bottlenecks: unreliable command invocation and execution context maintenance, credential and identity transition errors, and verification ambiguity where partial telemetry is misinterpreted as success.
  • Agent operation often incurred large resource burn through repeated low‑level retries and looping behaviour. Examples include an environment setup task that required four attempts, a reverse‑shell deployment that took nine attempts consuming 11 minutes and 53k tokens, and a credential‑dumping task in a fully autonomous run that expended about 2.6×10^5 tokens without yielding verifiable artifacts.
  • Self‑scaffolded runs showed high variance: some completed nearly all proposed tasks while others failed early; fully autonomous runs exhibited still greater variance in both success count and token efficiency, illustrating that task success alone does not capture operational quality.
  • Agents sometimes advanced on weak or unverified signals, reusing invalid credentials or assuming privileged access despite judge verdicts marking prerequisite tasks unmet, producing end states that appear successful but lack correct intermediate evidence.

Limitations

The evaluation reflects upper‑bound performance by reporting best runs per configuration and uses a single attempt per run under a fixed token limit. Judge LMA verdicts can be noisy when telemetry is partial, and model behaviour and failure modes vary by implementation and trigger conditions. The testbed emphasises post‑compromise lateral movement and does not cover initial access vectors in the same depth. Results depend on the specific models and tooling interfaces evaluated and may not generalise to all deployments.

Implications

Offensive implications are twofold: LMAs can automate multi‑step adversary emulation and scale lateral‑movement workflows, reducing the labour and coordination costs of red teams or attackers. With adequate scaffolding, current LMAs can execute complex kill chains end‑to‑end, but operational brittleness remains exploitable; attackers may waste resources in retry loops or produce unreliable access when credential handling fails. The traces produced by LMAs—recurrent PowerShell execution, LSASS access attempts, Pass‑the‑Hash activity and writable‑share abuse—align with standard EDR and SIEM primitives and could be generated at scale by automated agents during intrusions or red‑team exercises.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.