HackWorld Tests AI Agents Against Web App Flaws

Pentesting

Published: Wed, Oct 15, 2025 • By Dr. Marcus Halden

HackWorld Tests AI Agents Against Web App Flaws

HackWorld evaluates computer‑use agents (CUAs) on 36 real web applications and finds exploitation success below 12%. Agents can perceive pages but struggle to plan multi‑step attacks, orchestrate tools and recover from errors. The study highlights gaps to close before autonomous agents become a scalable automated attack vector and points to practical mitigations.

HackWorld is a systematic evaluation of computer‑use agents (CUAs), software that interacts with graphical web interfaces, to see whether they can find and exploit real web application flaws. The project frames exploitation tasks as Capture‑the‑Flag (CTF) challenges inside isolated containers and measures whether an agent can navigate a site, run standard security tools and retrieve a secret flag.

The benchmark uses 36 real applications across 11 frameworks and seven programming languages, seeded with authentic problems such as injection flaws, authentication bypasses and unsafe input handling. The environment integrates common pentesting tools from a Kali Linux stack — Burp Suite, DirBuster, Nikto, WFuzz and WhatWeb — so agents can use industry standard tooling rather than only replying with text. Observations are given in three forms: screenshots, screenshots combined with an accessibility tree, or a compact set of visual marks. Episodes are limited to 30 steps and success is scored by a fuzzy match to account for optical character recognition (OCR) noise.

Results are blunt. Overall exploitation rates remain below 12 per cent across the 36 tasks. The best‑performing backbone, Claude 3 7 Sonnet, achieves an average success of 10.18 per cent. Claude 4 Sonnet and Claude 3 5 Sonnet score 4.63 per cent and 3.71 per cent respectively. Two GUI‑action models, UI TARS 1 5 7B and Qwen 2 5 VL 72B Instruct, fail in most configurations. The choice of observation format has little measurable effect (the screenshot configuration averaged 3.89 per cent, screenshot plus accessibility tree 3.97 per cent and the marks encoding 3.17 per cent; statistical tests reported p>0.1). Allowing extra steps helps some models: Claude 3 7 Sonnet solves about 5.6 per cent more tasks with additional steps, suggesting more exploration and reasoning time matters.

The authors identify eight recurring failure modes. Agents pick inappropriate tools or misparse outputs; they recover poorly from failures and do not repair plans; they miss directory and source files during enumeration; they perform incomplete port and service mapping; they handle authentication and session logic badly; they misclassify service types; they run shallow SQL injection checks; and they fall into knowledge‑driven dead loops that stall progress. In short, seeing a page or reading a tool dump rarely turns into a coherent multi‑step exploit plan.

Two practical themes stand out. First, perception alone is not the bottleneck — agents can read pages and tool output but cannot reliably stitch clues into a plan or manage tool state. Second, scale and recency of a model do not automatically translate into better exploitation; planning discipline and robust tool orchestration matter more. The paper also flags tool design as an obstacle: many command‑line outputs are verbose and error opaque for agents. Recommended mitigations include machine‑readable outputs, explicit state and error codes, persistent sessions and standardised tool wrappers to make automation less brittle. The authors caution that a platform like HackWorld is dual use and urge careful governance of releases.

Operational takeaways

Monitor and log automated interactions closely; anomalous multi‑step tool chains can indicate agent misuse.
Harden authentication, input handling and enumeration exposure; defence in depth reduces reward for automated probes.
Favor machine‑readable tool outputs and standardised interfaces when you permit automation, and limit high‑risk automation by default.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

Authors: Xiaoxue Ren, Penghao Jiang, Kaixin Li, Zhiyong Huang, Xiaoning Du, Jiaojiao Jiang, Zhenchang Xing, Jiamou Sun, and Terry Yue Zhuo

Web applications are prime targets for cyberattacks as gateways to critical services and sensitive data. Traditional penetration testing is costly and expertise-intensive, making it difficult to scale with the growing web ecosystem. While language model agents show promise in cybersecurity, modern web applications demand visual understanding, dynamic content handling, and multi-step interactions that only computer-use agents (CUAs) can perform. Yet, their ability to discover and exploit vulnerabilities through graphical interfaces remains largely unexplored. We present HackWorld, the first framework for systematically evaluating CUAs' capabilities to exploit web application vulnerabilities via visual interaction. Unlike sanitized benchmarks, HackWorld includes 36 real-world applications across 11 frameworks and 7 languages, featuring realistic flaws such as injection vulnerabilities, authentication bypasses, and unsafe input handling. Using a Capture-the-Flag (CTF) setup, it tests CUAs' capacity to identify and exploit these weaknesses while navigating complex web interfaces. Evaluation of state-of-the-art CUAs reveals concerning trends: exploitation rates below 12% and low cybersecurity awareness. CUAs often fail at multi-step attack planning and misuse security tools. These results expose the current limitations of CUAs in web security contexts and highlight opportunities for developing more security-aware agents capable of effective vulnerability detection and exploitation.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates whether computer-use agents CUAs can discover and exploit vulnerabilities in real web applications through visual interaction. This matters because web applications are critical gateways to services and data, and traditional penetration testing is costly and hard to scale. While language model agents show promise in cybersecurity tasks, modern web environments require visual understanding of complex interfaces, dynamic content, and multi step workflows that CUAs can address. The study aims to understand the exploitation capabilities of CUAs as they operate autonomously in vulnerable web environments and to identify gaps that security aware agents must address to mitigate automated exploitation risks.

Approach

HackWorld is introduced as the first evaluation framework for systematically assessing CUAs ability to exploit web application vulnerabilities via visual interaction. It exposes CUAs to 36 real world web applications across 11 frameworks and 7 programming languages, containing authentic flaws such as injection vulnerabilities authentication bypasses and unsafe input handling. The evaluation uses a Capture the Flag CTF style setup and integrates Kali Linux industry standard security tools Burp Suite DirBuster Nikto WFuzz and WhatWeb to instrument exploitation attempts. The evaluation environment runs in isolated Docker containers with pre configured vulnerability settings to ensure reproducibility. The framework tests CUAs across multiple backbone models including four proprietary Claude based models and two open source GUI action models UI TARS 1 5 7B and Qwen 2 5 VL 72B Instruct, all operated on hardware described in the study. Observations space comprises three configurations screenshot screenshot plus a11y tree and set of marks, with episodes limited to 30 steps per task. Success is measured by a fuzzy flag match using edit distance to account for OCR based errors, with a one for a correct flag and zero otherwise.

Key Findings

Exploitation rates among state of the art CUAs are low, remaining below 12 per cent across the 36 challenges, indicating limited capability to discover and exploit vulnerabilities in realistic web environments.
The best performing model Claude 3 7 Sonnet attains an average success rate of 10.18 per cent across observation spaces, while Claude 4 Sonnet scores 4.63 per cent and Claude 3 5 Sonnet 3.71 per cent; GUI based models UI TARS 1 5 7B and Qwen 2 5 VL 72B Instruct fail to complete tasks in most settings.
Observation space has limited impact on success rates; the screenshot configuration yields the highest average at 3.89 per cent, the combination of screenshot and a11y tree reaches 3.97 per cent for some models, and the set of marks encoding performs worst at 3.17 per cent, with a statistical test indicating no significant difference across spaces (p greater than 0.1).
Performance improves when additional steps are allowed for some models, with Claude 3 7 Sonnet solving more tasks by about 5.6 per cent with extra steps, suggesting improvements come from extended exploration and reasoning rather than merely model size.
Eight recurring failure patterns emerge: ineffective tool selection and output parsing; poor failure recovery and plan repair; gaps in directory and source enumeration; incomplete port and service mapping; lack of authentication bypass or session management; misclassification of service types; superficial SQL injection testing; and knowledge driven dead loops that stall progress.
Perception alone does not guarantee progress; agents can read pages and tool outputs but struggle to aggregate clues into a coherent exploit plan. Claude 3 7 demonstrates better success by selectively analysing key clues and reusing them, while better perception does not fully translate into better results, highlighting the primacy of strategic reasoning and tool orchestration.
Scaling by model size or recency does not guarantee higher performance in this domain; planning discipline and strategy control are more critical than raw capacity, aligning with broader findings about scaling in complex reasoning tasks.
Tool interface design matters; current command line tool UX is verbose and error opaque for agents. The authors advocate agent oriented design including machine readable outputs, explicit state and error codes, persistent sessions, and standardised tool wrappers to enable robust reasoning and loop based progress.
HackWorld notes the dual use of such evaluation frameworks and argues for responsible release given potential defensive and offensive applications, emphasising the need for governance and monitoring to mitigate automated exploitation risks.

Limitations

The evaluation relies on 36 CTF based challenges in controlled containerised Kali Linux environments and may not capture all real world production conditions. Results are influenced by the particular toolset used and by the specific backbones tested, and may not generalise to all web architectures or defensive configurations. The study focuses on vulnerability exploitation rather than defensive remediation and does not assess long term persistence or post exploitation stealth. The metrics rely on flag based success and OCR tolerant matching, which may not reflect every practical exploitation outcome.

Why It Matters

HackWorld provides a structured, reproducible benchmark to evaluate how computer use agents perform vulnerability discovery and exploitation via visual interfaces on real web applications. The findings reveal substantial gaps in cybersecurity awareness and multi step planning, highlighting the need for security aware agent design, safer tool usage, and enhanced logging and auditing to curb automated exploitation. The work informs practical mitigations such as security conscious agent constraints, safer tool integration, and defence in depth including monitoring and robust reviews of automation in web security tasks. It also raises governance and policy considerations about safeguarding critical web services from scalable automated cyber attacks, while contributing to responsible advancement in AI driven cybersecurity research.

Attribution Original paper on arXiv