Specialised LLM agents outgun generalists on web injections
Agents
Web apps now ship at the speed of slide decks, fuelled by no-code builders and AI-assisted development. Security tooling has not kept pace. Signature scanners miss context. Large Language Model (LLM) agents wander, rack up bills, and fail to reproduce results. Into that gap steps AWE, a multi-agent framework that opts for discipline over drift.
AWE packages vulnerability-specific pipelines for classes like cross-site scripting (XSS) and SQL injection, then lets a lightweight LLM act as an adviser for plan selection and payload synthesis. The system carries state through persistent memory, probes targets with browser-backed verification, and aims for evidence before it shouts ‘vulnerability’. The pitch is simple: less exploration theatre, more exploitation proof.
On the 104-challenge XBOW benchmark, that approach pays off for injections. AWE reports 87% success on XSS and 66.7% on blind SQL injection, beating a general-purpose system, MAPTA, by 30.5 and 33.3 percentage points on those categories. It is also quicker and thriftier: average solve time 53.1 seconds versus 190.8, median 35.7 versus 156.2, and about 98% fewer tokens (1.12 million versus 54.9 million). API cost drops to $7.73 from $21.38 in the comparison, despite AWE using a mid-tier model (Claude Sonnet 4) and MAPTA running a more capable one (GPT-5). Architecture, not just model muscle, seems to matter.
There is a catch, and it is not small. Breadth. Overall, AWE solves 51.9% of XBOW where MAPTA lands 76.9%. The specialisation that nails injections narrows horizons elsewhere. The authors are frank about scope: AWE targets black-box injection flaws, not protocol-level, cryptographic or many business-logic bugs, and it does not yet chain long, multi-step attacks. Its context and filtering rely on heuristics that may wobble on idiosyncratic frameworks. Performance is also sensitive to model choice and pricing, which change in the real world without asking your permission.
Does this matter for defenders? If you care about XSS and SQLi in continuous testing, yes. AWE’s design reduces the LLM’s tendency to hallucinate findings by insisting on browser-verified effects and by remembering what it already tried. That produces more reproducible, exploitation-driven results than free-roaming agents, and at a fraction of the time and tokens. It will not replace broader exploratory systems, but it can out-deliver them where the target class is known and common.
What to do with this
- Run specialised, vulnerability-aware agents alongside a generalist. Use the former to cut cost and noise on high-frequency injection checks.
- Pin model versions and log full evidence trails. ‘Deterministic’ means ‘more stable than before’, not ‘immutable’.
- Scope honestly: keep business-logic and long chains with tools that handle multi-step reasoning and sandboxed execution.
Two open questions remain. First, how well do XBOW and DVWA-style targets track today’s frameworks and peculiar sanitisation stacks? Second, can teams maintain a stable of specialised pipelines as frameworks shift? Even with those caveats, the direction is right: less agent improv, more proof. Not a breakthrough, but a sensible correction.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
AWE: Adaptive Agents for Dynamic Web Penetration Testing
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies autonomous black-box web penetration testing in the context of rapidly evolving web development practices and AI-assisted deployment. Traditional signature-driven scanners struggle with novel sanitisation and application-specific contexts, while recent LLM-driven autonomous testers tend to explore unconstrainedly, producing high cost, unstable behaviour and poor reproducibility. This gap creates a need for systems that can reason about contextual input handling, maintain state across probes and produce verified, exploit-driven findings.
Approach
The authors present AWE, a memory-augmented multi-agent framework that embeds vulnerability-specific analysis pipelines inside a lightweight LLM orchestration layer. AWE has three layers: an Orchestration Layer that maintains a global exploitation context and controls resource budgets; a Specialized Agents Layer with dedicated pipelines for classes such as XSS, SQL injection, SSTI, command injection and IDOR; and a Foundation Layer providing persistent memory, hybrid payload generation, browser-backed verification and endpoint discovery. The LLM acts primarily as an adviser for plan selection and payload synthesis, while agents implement deterministic, evidence-driven procedures. Evaluation used the 104-challenge XBOW benchmark and controlled DVWA experiments, comparing AWE (using Claude Sonnet 4) to a state-of-the-art general-purpose system, MAPTA (using a more capable model and sandbox execution).
Key Findings
- AWE substantially improves success on targeted injection classes: reported XSS success of 87% (a 30.5 percentage point improvement over MAPTA on the same set) and blind SQL injection success of 66.7% (a 33.3 percentage point improvement).
- AWE trades breadth for efficiency: overall solve rate on XBOW was 51.9% versus MAPTA's 76.9%, but AWE was much faster and cheaper—average solve time 53.1 seconds versus 190.8 seconds for MAPTA, median solve time 35.7 seconds versus 156.2 seconds, and total token consumption 1.12M versus 54.9M (approximately 98% fewer tokens).
- Operational cost and iteration efficiency improved: API cost was reported as $7.73 for AWE versus $21.38 for MAPTA (around 63% lower cost), and the underlying model (Claude Sonnet 4) required fewer payload iterations to converge (10–40 attempts) than alternatives tested.
- Specialisation yields complementary strengths: AWE excels at context-sensitive, deterministic exploitation that benefits from persistent memory and browser verification, while MAPTA performs better on broad, multi-step or semantic reasoning tasks that require long-horizon planning and sandboxed code execution.
Limitations
AWE focuses on injection-centric, black-box vulnerabilities and does not target protocol-level, cryptographic or many business-logic flaws. Its agents do not yet coordinate long multi-stage attack chains, and its context and filter models rely on heuristic abstractions that may fail for highly idiosyncratic frameworks. Performance is also sensitive to the chosen LLM; model behaviour and pricing changes could affect effectiveness.
Why It Matters
The paper shows that architectural design, persistent memory and vulnerability-aware pipelines can matter as much as raw LLM capability for practical automated exploitation. AWE’s efficiency and determinism make it attractive for high-frequency or continuous security testing, while demonstrating that specialised agent pipelines can out-perform larger models on targeted tasks. The complementary strengths of specialised and general-purpose systems suggest hybrid designs for broad, dependable autonomous testing. The authors also note security considerations: such tooling can enhance red-team automation but must be evaluated with safety controls and auditability since it could be repurposed for offensive use.