New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

Specialised LLM agents outgun generalists on web injections

Published: Tue, Mar 03, 2026 • By Clara Nyx

Agents

AWE is a memory-augmented, multi-agent system for web penetration testing that swaps freewheeling LLM exploration for vulnerability-specific pipelines. On the XBOW benchmark it hits 87% XSS and 66.7% blind SQLi, while running faster and cheaper than MAPTA. Overall coverage is lower, so it complements, not replaces, generalist agents.

Web apps now ship at the speed of slide decks, fuelled by no-code builders and AI-assisted development. Security tooling has not kept pace. Signature scanners miss context. Large Language Model (LLM) agents wander, rack up bills, and fail to reproduce results. Into that gap steps AWE, a multi-agent framework that opts for discipline over drift.

AWE packages vulnerability-specific pipelines for classes like cross-site scripting (XSS) and SQL injection, then lets a lightweight LLM act as an adviser for plan selection and payload synthesis. The system carries state through persistent memory, probes targets with browser-backed verification, and aims for evidence before it shouts ‘vulnerability’. The pitch is simple: less exploration theatre, more exploitation proof.

On the 104-challenge XBOW benchmark, that approach pays off for injections. AWE reports 87% success on XSS and 66.7% on blind SQL injection, beating a general-purpose system, MAPTA, by 30.5 and 33.3 percentage points on those categories. It is also quicker and thriftier: average solve time 53.1 seconds versus 190.8, median 35.7 versus 156.2, and about 98% fewer tokens (1.12 million versus 54.9 million). API cost drops to $7.73 from $21.38 in the comparison, despite AWE using a mid-tier model (Claude Sonnet 4) and MAPTA running a more capable one (GPT-5). Architecture, not just model muscle, seems to matter.

There is a catch, and it is not small. Breadth. Overall, AWE solves 51.9% of XBOW where MAPTA lands 76.9%. The specialisation that nails injections narrows horizons elsewhere. The authors are frank about scope: AWE targets black-box injection flaws, not protocol-level, cryptographic or many business-logic bugs, and it does not yet chain long, multi-step attacks. Its context and filtering rely on heuristics that may wobble on idiosyncratic frameworks. Performance is also sensitive to model choice and pricing, which change in the real world without asking your permission.

Does this matter for defenders? If you care about XSS and SQLi in continuous testing, yes. AWE’s design reduces the LLM’s tendency to hallucinate findings by insisting on browser-verified effects and by remembering what it already tried. That produces more reproducible, exploitation-driven results than free-roaming agents, and at a fraction of the time and tokens. It will not replace broader exploratory systems, but it can out-deliver them where the target class is known and common.

What to do with this

Run specialised, vulnerability-aware agents alongside a generalist. Use the former to cut cost and noise on high-frequency injection checks.
Pin model versions and log full evidence trails. ‘Deterministic’ means ‘more stable than before’, not ‘immutable’.
Scope honestly: keep business-logic and long chains with tools that handle multi-step reasoning and sandboxed execution.

Two open questions remain. First, how well do XBOW and DVWA-style targets track today’s frameworks and peculiar sanitisation stacks? Second, can teams maintain a stable of specialised pipelines as frameworks shift? Even with those caveats, the direction is right: less agent improv, more proof. Not a breakthrough, but a sensible correction.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

AWE: Adaptive Agents for Dynamic Web Penetration Testing

Authors: Akshat Singh Jaswal and Ashish Baghel

Modern web applications are increasingly produced through AI-assisted development and rapid no-code deployment pipelines, widening the gap between accelerating software velocity and the limited adaptability of existing security tooling. Pattern-driven scanners fail to reason about novel contexts, while emerging LLM-based penetration testers rely on unconstrained exploration, yielding high cost, unstable behavior, and poor reproducibility. We introduce AWE, a memory-augmented multi-agent framework for autonomous web penetration testing that embeds structured, vulnerability-specific analysis pipelines within a lightweight LLM orchestration layer. Unlike general-purpose agents, AWE couples context aware payload mutations and generations with persistent memory and browser-backed verification to produce deterministic, exploitation-driven results. Evaluated on the 104-challenge XBOW benchmark, AWE achieves substantial gains on injection-class vulnerabilities - 87% XSS success (+30.5% over MAPTA) and 66.7% blind SQL injection success (+33.3%) - while being much faster, cheaper, and more token-efficient than MAPTA, despite using a midtier model (Claude Sonnet 4) versus MAPTA's GPT-5. MAPTA retains higher overall coverage due to broader exploratory capabilities, underscoring the complementary strengths of specialized and general-purpose architectures. Our results demonstrate that architecture matters as much as model reasoning capabilities: integrating LLMs into principled, vulnerability-aware pipelines yields substantial gains in accuracy, efficiency, and determinism for injection-class exploits. The source code for AWE is available at: https://github.com/stuxlabs/AWE

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies autonomous black-box web penetration testing in the context of rapidly evolving web development practices and AI-assisted deployment. Traditional signature-driven scanners struggle with novel sanitisation and application-specific contexts, while recent LLM-driven autonomous testers tend to explore unconstrainedly, producing high cost, unstable behaviour and poor reproducibility. This gap creates a need for systems that can reason about contextual input handling, maintain state across probes and produce verified, exploit-driven findings.

Approach

The authors present AWE, a memory-augmented multi-agent framework that embeds vulnerability-specific analysis pipelines inside a lightweight LLM orchestration layer. AWE has three layers: an Orchestration Layer that maintains a global exploitation context and controls resource budgets; a Specialized Agents Layer with dedicated pipelines for classes such as XSS, SQL injection, SSTI, command injection and IDOR; and a Foundation Layer providing persistent memory, hybrid payload generation, browser-backed verification and endpoint discovery. The LLM acts primarily as an adviser for plan selection and payload synthesis, while agents implement deterministic, evidence-driven procedures. Evaluation used the 104-challenge XBOW benchmark and controlled DVWA experiments, comparing AWE (using Claude Sonnet 4) to a state-of-the-art general-purpose system, MAPTA (using a more capable model and sandbox execution).

Key Findings

AWE substantially improves success on targeted injection classes: reported XSS success of 87% (a 30.5 percentage point improvement over MAPTA on the same set) and blind SQL injection success of 66.7% (a 33.3 percentage point improvement).
AWE trades breadth for efficiency: overall solve rate on XBOW was 51.9% versus MAPTA's 76.9%, but AWE was much faster and cheaper—average solve time 53.1 seconds versus 190.8 seconds for MAPTA, median solve time 35.7 seconds versus 156.2 seconds, and total token consumption 1.12M versus 54.9M (approximately 98% fewer tokens).
Operational cost and iteration efficiency improved: API cost was reported as $7.73 for AWE versus $21.38 for MAPTA (around 63% lower cost), and the underlying model (Claude Sonnet 4) required fewer payload iterations to converge (10–40 attempts) than alternatives tested.
Specialisation yields complementary strengths: AWE excels at context-sensitive, deterministic exploitation that benefits from persistent memory and browser verification, while MAPTA performs better on broad, multi-step or semantic reasoning tasks that require long-horizon planning and sandboxed code execution.

Limitations

AWE focuses on injection-centric, black-box vulnerabilities and does not target protocol-level, cryptographic or many business-logic flaws. Its agents do not yet coordinate long multi-stage attack chains, and its context and filter models rely on heuristic abstractions that may fail for highly idiosyncratic frameworks. Performance is also sensitive to the chosen LLM; model behaviour and pricing changes could affect effectiveness.

Why It Matters

The paper shows that architectural design, persistent memory and vulnerability-aware pipelines can matter as much as raw LLM capability for practical automated exploitation. AWE’s efficiency and determinism make it attractive for high-frequency or continuous security testing, while demonstrating that specialised agent pipelines can out-perform larger models on targeted tasks. The complementary strengths of specialised and general-purpose systems suggest hybrid designs for broad, dependable autonomous testing. The authors also note security considerations: such tooling can enhance red-team automation but must be evaluated with safety controls and auditability since it could be repurposed for offensive use.

Links Original paper on arXiv

Specialised LLM agents outgun generalists on web injections

What to do with this

📋 Original Paper Title and Abstract

AWE: Adaptive Agents for Dynamic Web Penetration Testing

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

Multi-agent LLM builds digital twins to test exploits

Difficulty-aware LLM agents lift pen test success

LLM agents struggle to reproduce web vulnerabilities

Related Research

Get the weekly digest