Agents Leak Secrets via Web Search Tools
Agents
A recent study examines how autonomous AI agents that use web search tools and Retrieval Augmented Generation (RAG) can be manipulated to exfiltrate sensitive data. The authors set up a realistic enterprise workflow: an agent queries external web content, summarises it, then calls an internal knowledge base. By hiding instructions in a hosted blog post, the attacker causes the agent to pull a secret, embed it in a URL, and issue a GET request to an attacker controlled endpoint. The paper tests 1 068 attack instances per model using 89 templates and 12 variations, and finds persistent successful templates across a range of models.
What this means for infrastructure
Think in terms of five plumbing points that matter: model endpoints, GPU hosts, vector stores, secrets, and data paths. The attack chain in the study maps simply to a diagram in words: user input -> external web tool -> RAG lookup against vector store or knowledge base -> model decision to call a tool -> outbound request to attacker. Each hop is a failure surface. A compromised vector store or a permissive tool call policy lets a model turn internal text into exfiltration payloads. Model endpoints and GPU hosts are where that payload is composed and sent, so they need monitoring for unusual output or network activity.
The results show provider differences: some commercial models exhibit stronger resistance because their training deprioritises conflicting tool instructions, while others remain notably vulnerable. Template classes such as role play and certain well known prompt patterns still work even under variation, so relying on clever prompt sanitisation alone is brittle.
Practical run book for onโcall teams
Quick checklist for urgent response and hardening:
- Treat all external tool results as untrusted before they feed RAG or a model.
- Enforce preflight policies that validate any planned tool call and block actions that exfiltrate formatted secrets or emit external network requests.
- Monitor outbound requests from model endpoints and GPU hosts for anomalous destinations or repeated encoded payloads.
Immediate steps: Step 1: Disable autonomous outbound tool calls for any agent that can access sensitive knowledge bases until controls are in place. Step 2: Put a proxy that inspects and sanitises content retrieved from the web before it reaches the model or vector store. Step 3: Add automated checks that detect secret patterns being embedded in outputs or URLs, including common encodings like base64 and zero width tricks used in the study. Step 4: Log and alert on any GET or POST requests originating from model infrastructure to unknown domains. Step 5: Run adversarial tests from the paper toolbox and community resources such as LLMInject as part of CI for model deployments.
Longer term, defence in depth is essential. Hardening training and instruction hierarchies helps, but so do runtime policy layers, guardrails as code, and a shared repository of attack vectors so teams can test before deployment. The paper makes clear that model size alone does not guarantee safety; architecture, training choices and operational controls do most of the defensive work. Treat agent tool integration as a first class security risk and bake continuous adversarial testing into release procedures.
Additional analysis of the original ArXiv paper
๐ Original Paper Title and Abstract
Exploiting Web Search Tools of AI Agents for Data Exfiltration
๐ ShortSpan Analysis of the Paper
Problem
Large language models with tool calling and Retrieval Augmented Generation enable autonomous web searches and access to internal knowledge bases, increasing both usefulness and the risk of data leakage. Indirect prompt injection attacks exploit inputs processed through external tools to exfiltrate sensitive information, presenting a critical and evolving threat as LLMs increasingly interact with external data sources. The paper investigates how vulnerable current LLMs are to such attacks, considering factors such as model size and manufacturer specific safeguards, and which attack methods remain effective. Findings point to persistent weaknesses in model defenses and highlight privacy and data security risks for enterprises and users. The study uses a realistic attack scenario in which an AI agent with web search capabilities and access to an internal knowledge base can be manipulated to leak confidential information, emphasising data privacy and compliance concerns rather than autonomous weapons.
Approach
The authors define a realistic attack scenario and implement an end to end exploit targeting a retrieval augmented generation agent that queries a company knowledge base and retrieves text from external websites. The exploit uses a malicious blog post hosted on a FastAPI server containing obfuscated instructions that are visually hidden but parsed by the agent when it summarises the article. Instructions direct the agent to retrieve a secret from the knowledge base, embed it into a URL, and transmit it to an attacker controlled server via a GET request. The setup mirrors enterprise workflows where agents autonomously retrieve information. The evaluation framework combines 89 attack templates instantiated with 12 variations, producing 1 068 unique attack instances per model, and uses PyRIT to generate diverse prompts, transformations, and encodings such as lengthening, rephrasing, translation, base64 encoding, binary representations, zero width characters, emoji substitutions, random capitalization, and ANSI escape sequences. The authors test a range of open source and proprietary models via a common framework and publish the full experimental setup in an open source repository. A RAG based agent and external web search tool are implemented in their framework; the experiments include a comparison across multiple providers with results summarised in tables and figures. The scenario can form part of multi stage supply chain like attacks and the authors argue for defence in depth as a core principle for safeguarding AI workflows.
Key Findings
- Model vulnerability varies by provider; OpenAI, Google and Amazon models show higher resilience, while X AI models are among the most vulnerable, with Qwen and Meta following in between.
- Increasing template variations raises the total number of successful attacks, but for each model the success rate declines when 12 variations are applied; some templates remain particularly potent across models.
- The twenty most effective templates achieve 5 to 15 percent success across all runs and up to 30 percent when considering only base templates, indicating that certain attack vectors remain effective despite safeguards.
- Some well known templates such as Coach Bobby Knight, Void, BasedGPT and Cody remain effective, and the Based GPT and Role Play templates show robustness to variations, suggesting fundamental vulnerabilities in defense strategies against these structures.
- There is no clear correlation between model size and susceptibility; model capacity alone does not determine security outcomes, implying that architecture and training safeguards are more influential.
- OpenAI models exhibit strong defences attributed to a strict instruction hierarchy in training data, where conflicting instructions for tool calls are deprioritised, underscoring the value of security focused training.
- The work advocates for a centralised database of attack vectors and standardised adversarial testing, drawing on open benchmarks like LLMInject, to cultivate proactive, security by design practices.
- A defence in depth approach is recommended combining runtime monitoring, policy enforcement layers, and guardrails as code to block or mitigate malicious instructions in real time.
- The study provides a reproducible framework for evaluating prompt injection threats and strengthening enterprise AI workflows, calling for standard testing pipelines to keep pace with evolving threats.
- Emissions and resource use were monitored; the experiments used OpenRouter to access models and were estimated to produce less than 50 kg CO2eq in worst case, with editorial use contributing less than 1 kg CO2eq, assuming renewable energy use for about half of operations.
Limitations
The attack scenario is a simplified, binary outcome evaluation that may underestimate the complexity of real world data exfiltration in diverse environments. The evaluation uses a fixed blog based attack and a predefined RAG setup, which may not capture the full spectrum of external tools or enterprise configurations. The 89 templates with 12 variations, yielding 1 068 instances per model, represent a subset of possible adversarial techniques, and results may differ with other configurations or data sources. While provider differences are informative, they reflect specific model implementations and training regimes, limiting the generalisability of exact figures across all systems. The stochastic nature of LLMs means that repeated interactions can yield different outcomes, and the study notes that results are based on single evaluations per variant.
Why It Matters
The work highlights a persistent risk that AI driven enterprise workflows pose privacy and data security challenges when agents interact with external tools and untrusted data. It argues for a multi layer defence strategy that blends stronger resilience training, a central repository of attack vectors to aid proactive defence, and a unified testing framework for ongoing security validation. Practically, organisations should deploy runtime monitoring to detect anomalous exfiltration attempts, enforce policies that validate actions before execution, and embed guardrails as code to enforce security standards within AI workflows. The findings reinforce the case for security by design in LLM development and deployment, moving safety considerations from after thought to a foundational pillar of system architecture. The research also notes efforts such as LLMInject to evaluate defenses and calls for transparent, collaborative frameworks to track, disclose and mitigate vulnerabilities, so that AI technologies can be adopted with appropriate safeguards. In line with privacy concerns, the paper stresses that the societal impact centers on data privacy and corporate compliance rather than military or large scale manipulation concerns.