HackWorld Tests AI Agents Against Web App Flaws
Pentesting
HackWorld is a systematic evaluation of computer‑use agents (CUAs), software that interacts with graphical web interfaces, to see whether they can find and exploit real web application flaws. The project frames exploitation tasks as Capture‑the‑Flag (CTF) challenges inside isolated containers and measures whether an agent can navigate a site, run standard security tools and retrieve a secret flag.
The benchmark uses 36 real applications across 11 frameworks and seven programming languages, seeded with authentic problems such as injection flaws, authentication bypasses and unsafe input handling. The environment integrates common pentesting tools from a Kali Linux stack — Burp Suite, DirBuster, Nikto, WFuzz and WhatWeb — so agents can use industry standard tooling rather than only replying with text. Observations are given in three forms: screenshots, screenshots combined with an accessibility tree, or a compact set of visual marks. Episodes are limited to 30 steps and success is scored by a fuzzy match to account for optical character recognition (OCR) noise.
Results are blunt. Overall exploitation rates remain below 12 per cent across the 36 tasks. The best‑performing backbone, Claude 3 7 Sonnet, achieves an average success of 10.18 per cent. Claude 4 Sonnet and Claude 3 5 Sonnet score 4.63 per cent and 3.71 per cent respectively. Two GUI‑action models, UI TARS 1 5 7B and Qwen 2 5 VL 72B Instruct, fail in most configurations. The choice of observation format has little measurable effect (the screenshot configuration averaged 3.89 per cent, screenshot plus accessibility tree 3.97 per cent and the marks encoding 3.17 per cent; statistical tests reported p>0.1). Allowing extra steps helps some models: Claude 3 7 Sonnet solves about 5.6 per cent more tasks with additional steps, suggesting more exploration and reasoning time matters.
The authors identify eight recurring failure modes. Agents pick inappropriate tools or misparse outputs; they recover poorly from failures and do not repair plans; they miss directory and source files during enumeration; they perform incomplete port and service mapping; they handle authentication and session logic badly; they misclassify service types; they run shallow SQL injection checks; and they fall into knowledge‑driven dead loops that stall progress. In short, seeing a page or reading a tool dump rarely turns into a coherent multi‑step exploit plan.
Two practical themes stand out. First, perception alone is not the bottleneck — agents can read pages and tool output but cannot reliably stitch clues into a plan or manage tool state. Second, scale and recency of a model do not automatically translate into better exploitation; planning discipline and robust tool orchestration matter more. The paper also flags tool design as an obstacle: many command‑line outputs are verbose and error opaque for agents. Recommended mitigations include machine‑readable outputs, explicit state and error codes, persistent sessions and standardised tool wrappers to make automation less brittle. The authors caution that a platform like HackWorld is dual use and urge careful governance of releases.
Operational takeaways
- Monitor and log automated interactions closely; anomalous multi‑step tool chains can indicate agent misuse.
- Harden authentication, input handling and enumeration exposure; defence in depth reduces reward for automated probes.
- Favor machine‑readable tool outputs and standardised interfaces when you permit automation, and limit high‑risk automation by default.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities
🔍 ShortSpan Analysis of the Paper
Problem
The paper investigates whether computer-use agents CUAs can discover and exploit vulnerabilities in real web applications through visual interaction. This matters because web applications are critical gateways to services and data, and traditional penetration testing is costly and hard to scale. While language model agents show promise in cybersecurity tasks, modern web environments require visual understanding of complex interfaces, dynamic content, and multi step workflows that CUAs can address. The study aims to understand the exploitation capabilities of CUAs as they operate autonomously in vulnerable web environments and to identify gaps that security aware agents must address to mitigate automated exploitation risks.
Approach
HackWorld is introduced as the first evaluation framework for systematically assessing CUAs ability to exploit web application vulnerabilities via visual interaction. It exposes CUAs to 36 real world web applications across 11 frameworks and 7 programming languages, containing authentic flaws such as injection vulnerabilities authentication bypasses and unsafe input handling. The evaluation uses a Capture the Flag CTF style setup and integrates Kali Linux industry standard security tools Burp Suite DirBuster Nikto WFuzz and WhatWeb to instrument exploitation attempts. The evaluation environment runs in isolated Docker containers with pre configured vulnerability settings to ensure reproducibility. The framework tests CUAs across multiple backbone models including four proprietary Claude based models and two open source GUI action models UI TARS 1 5 7B and Qwen 2 5 VL 72B Instruct, all operated on hardware described in the study. Observations space comprises three configurations screenshot screenshot plus a11y tree and set of marks, with episodes limited to 30 steps per task. Success is measured by a fuzzy flag match using edit distance to account for OCR based errors, with a one for a correct flag and zero otherwise.
Key Findings
- Exploitation rates among state of the art CUAs are low, remaining below 12 per cent across the 36 challenges, indicating limited capability to discover and exploit vulnerabilities in realistic web environments.
- The best performing model Claude 3 7 Sonnet attains an average success rate of 10.18 per cent across observation spaces, while Claude 4 Sonnet scores 4.63 per cent and Claude 3 5 Sonnet 3.71 per cent; GUI based models UI TARS 1 5 7B and Qwen 2 5 VL 72B Instruct fail to complete tasks in most settings.
- Observation space has limited impact on success rates; the screenshot configuration yields the highest average at 3.89 per cent, the combination of screenshot and a11y tree reaches 3.97 per cent for some models, and the set of marks encoding performs worst at 3.17 per cent, with a statistical test indicating no significant difference across spaces (p greater than 0.1).
- Performance improves when additional steps are allowed for some models, with Claude 3 7 Sonnet solving more tasks by about 5.6 per cent with extra steps, suggesting improvements come from extended exploration and reasoning rather than merely model size.
- Eight recurring failure patterns emerge: ineffective tool selection and output parsing; poor failure recovery and plan repair; gaps in directory and source enumeration; incomplete port and service mapping; lack of authentication bypass or session management; misclassification of service types; superficial SQL injection testing; and knowledge driven dead loops that stall progress.
- Perception alone does not guarantee progress; agents can read pages and tool outputs but struggle to aggregate clues into a coherent exploit plan. Claude 3 7 demonstrates better success by selectively analysing key clues and reusing them, while better perception does not fully translate into better results, highlighting the primacy of strategic reasoning and tool orchestration.
- Scaling by model size or recency does not guarantee higher performance in this domain; planning discipline and strategy control are more critical than raw capacity, aligning with broader findings about scaling in complex reasoning tasks.
- Tool interface design matters; current command line tool UX is verbose and error opaque for agents. The authors advocate agent oriented design including machine readable outputs, explicit state and error codes, persistent sessions, and standardised tool wrappers to enable robust reasoning and loop based progress.
- HackWorld notes the dual use of such evaluation frameworks and argues for responsible release given potential defensive and offensive applications, emphasising the need for governance and monitoring to mitigate automated exploitation risks.
Limitations
The evaluation relies on 36 CTF based challenges in controlled containerised Kali Linux environments and may not capture all real world production conditions. Results are influenced by the particular toolset used and by the specific backbones tested, and may not generalise to all web architectures or defensive configurations. The study focuses on vulnerability exploitation rather than defensive remediation and does not assess long term persistence or post exploitation stealth. The metrics rely on flag based success and OCR tolerant matching, which may not reflect every practical exploitation outcome.
Why It Matters
HackWorld provides a structured, reproducible benchmark to evaluate how computer use agents perform vulnerability discovery and exploitation via visual interfaces on real web applications. The findings reveal substantial gaps in cybersecurity awareness and multi step planning, highlighting the need for security aware agent design, safer tool usage, and enhanced logging and auditing to curb automated exploitation. The work informs practical mitigations such as security conscious agent constraints, safer tool integration, and defence in depth including monitoring and robust reviews of automation in web security tasks. It also raises governance and policy considerations about safeguarding critical web services from scalable automated cyber attacks, while contributing to responsible advancement in AI driven cybersecurity research.