MUZZLE exposes adaptive prompt injection risks in agents
Agents
Large Language Model (LLM) based web agents are no longer a lab curiosity. They act on behalf of users, click buttons, read forum posts and submit forms. MUZZLE is a tool that turns that capability into an offensive advantage: it observes an agent operating in a sandbox, identifies where untrusted web content is most likely to influence behaviour, and then adapts payloads to hijack the agent.
The framework operates in distinct stages that are worth understanding because they map cleanly to defensive controls. Reconnaissance records a benign run and an adversarial run. A Summarizer compresses the telemetry into an execution trace. A Grafter ranks UI elements by how much control they afford an attacker. Attack Synthesis inserts a placeholder token into chosen surfaces, replays the agent to collect context, and generates candidate malicious instructions. Reflection replays with the modified UI and checks whether confidentiality, integrity or availability goals were achieved. The Zoo component provides a sandbox with deterministic reinitialisation so tests are reproducible.
Why this matters in practice: MUZZLE automatically finds indirect prompt injection vectors that simple templates miss. In experiments it discovers 37 distinct exploit paths across four web applications and ten adversarial objectives. Two novel outcomes stand out. First, a single malicious input in one application can influence actions in another connected service, creating cross‑application attack chains. Second, the system produces agent‑tailored phishing scenarios that read like believable, context aware instructions for the target agent.
Practical risks in plain language
Think of an agent as a person with a browser tab open and a very literal rulebook. Untrusted content that looks like instructions can alter the rulebook. Common high‑salience surfaces include issue replies, post replies and listing responses. Problems arise along five infrastructure axes: model endpoints that process mixed content, shared GPUs and inference backends that expose a single compromised model to many sessions, vector stores that may persist malicious prompts, secrets accessible via agent actions, and data paths that let malicious content flow across services.
Diagram in words: agent session captures UI element text then sends it to the model; model returns action; action touches another service; result leaks secrets or mutates data. That simple chain is enough for confidentiality or integrity failures when the instructions are crafted against the agent's observed behaviour.
Defence is not mystical. Three priorities reduce risk and are practical to implement:
- Sanitise and canonicalise untrusted UI content before the agent sees it; remove or neutralise instruction‑like phrasing and placeholder tokens.
- Enforce isolation between contexts and services; per‑task model sessions, per‑user vector namespaces and strict credential scoping limit cross‑application impact.
- Monitor runtime behaviour against expected plans and roll back or halt if the agent deviates or tries sensitive actions without explicit user confirmation.
Operational checklist and quick wins: add a lightweight content classifier to gate instruction like text, log full execution traces for post‑incident root cause, and require stepwise user confirmation for actions touching critical resources. Where possible, place agents in a sandbox that can be deterministically reinitialised for reproducible testing and integrate adaptive red‑teaming into CI to catch regressions.
MUZZLE is a reminder that adaptive attackers will use the agent's own behaviour against it. The framework is valuable for defenders because it surfaces realistic, long horizon and cross‑service risks, but it also raises the usual dual use flag. For teams under time pressure, focus first on sanitisation, isolation and runtime checks; they buy the most risk reduction for the least engineering lift.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies indirect prompt injection attacks on large language model based web agents that autonomously interact with websites to perform user tasks. The authors note that untrusted web content can embed malicious instructions that override user intent, leading to confidentiality, integrity or availability violations. Existing evaluations rely on fixed templates, manually chosen injection surfaces, or narrow scenarios, which fail to capture realistic adaptive attacks encountered in practice. The work introduces MUZZLE, an automated agentic red teaming framework designed to evaluate web agent security against indirect prompt injection by using the agent's own action traces to locate high salience injection surfaces and to generate context aware malicious instructions that exploit the agent across multiple objectives.
Approach
MUZZLE employs a multi agent architecture to autonomously discover end to end indirect prompt injection attacks in a sandboxed web environment called The Zoo. It prioritises vulnerable user interface elements along the agent's execution trajectory, automatically synthesises adversarial payloads, and adaptively refines its strategy based on execution feedback. The framework is compatible with diverse web applications, agent implementations and underlying LLM backends, supporting reproducible end to end evaluation. MUZZLE operates in three phases: Reconnaissance collects detailed telemetry from benign and adversarial task executions; Summarizer compresses the telemetry into a structured sequence of execution steps; Grafter identifies and ranks highly salient UI elements by exploitation potential; Attack Synthesis selects top ranked surfaces, inserts a placeholder token [INSTR], runs the agent again to obtain contextual information, and uses the Payload Generator to produce candidate malicious instructions which are evaluated by the agent's LLM. The Reflection phase deploys the modified UI and judges whether the attack succeeded, with failures triggering iterative exploration of new paths. The Zoo provides a fully sandboxed backend with deterministic re initialisation, enabling reproducible cross app testing and cross application attack scenarios. The threat model assumes a black box adversary able to observe the victim agent operating in configurable web environments, able to submit malicious content via standard front end interfaces, and with goals including confidentiality, integrity and availability violations. The system relies on the AutoGen library to coordinate multiple LLM based red team agents and uses a BrowserUse style web agent scaffold for the victim.
Key Findings
- MUZZLE discovered 37 distinct indirect prompt injection attacks that violate confidentiality, integrity or availability across four web applications and ten adversarial objectives, under three different victim LLM backends.
- The framework uncovered novel attack strategies including two cross application prompt injections and an agent tailored phishing scenario, highlighting risk vectors beyond single site attacks.
- In the evaluation, the four applications were Gitea, Postmill, Classifieds and Northwind via phpMyAdmin; tasks span repository management, forum participation, e commerce style browsing, and administrative database operations.
- Across applications, some surfaces such as issue replies, post replies and listing responses proved highly effective as injection vessels, with certain actions enabling credential exfiltration, account changes and data manipulation in some runs.
- Model differences were observed: GPT 4 1 generally produced more end to end attacks than GPT 4 o, while Qwen 3 VL 32B instruct displayed patterns similar to GPT 4 1 with limited recovery and frequent completion of adversarial objectives once hijacked; GPT 4 o showed stronger disengagement from adversarial trajectories in some cases.
- Cross app attacks demonstrated that a prompt injection originating in one application could drive destructive actions in a separate interconnected service such as deleting orders in Northwind or altering account details in Postmill, illustrating an expanded risk surface.
- Runtime analysis indicates most overhead derives from LLM inference and web agent execution, with substantial time also spent on environment initialisation, while payload exploration accounts for a smaller but meaningful portion of the total time.
Limitations
The evaluation relies on a sandboxed web environment and a fixed set of web applications and tasks, which may not capture all real world complexities. Results depend on the chosen LLM back ends and agent scaffolds, and while the framework aims to be model and application agnostic, generalisability to unseen architectures should be assessed. The authors acknowledge the dual use nature of red teaming approaches and stress the need for responsible safeguards and governance to prevent misuse.
Why It Matters
MUZZLE provides a practical end to end framework for evaluating and improving defenses against indirect prompt injection in web enabled AI agents. By using agents own execution traces to drive attack generation, it captures realistic long horizon, multi step and cross application attack scenarios that fixed templates miss. The findings point to concrete defensive priorities such as input sanitisation, stronger isolation between applications, robust prompting strategies and runtime monitoring to detect anomalous agent behaviour. The work also expands the threat landscape with cross application IPI attacks and agent tailored phishing, underscoring urgent security considerations for deploying web capable AI agents across heterogeneous online services. The authors highlight the need for responsible use and safeguards to prevent misuse, and suggest MUZZLE as a foundation for evaluating and improving future defenses. This research was supported by Mozilla Corporation.