MCP tool poisoning steers LLM agents at scale

Pentesting

Published: Fri, Sep 26, 2025 • By Lydia Stratus

MCP tool poisoning steers LLM agents at scale

This paper shows that Model Context Protocol (MCP) tools can be poisoned to steer Large Language Model (LLM) agents. An automated framework, AutoMalTool, generates malicious tools with about 85% generation success and roughly 35% real-agent effectiveness while evading detectors. The finding exposes a scalable attack surface and a gap in current defences.

The paper examines a practical and underappreciated risk: metadata for Model Context Protocol (MCP) tools can be tampered with to change how Large Language Model (LLM) based agents behave. MCP tools standardise how agents call external capabilities. The researchers build an automated red team, AutoMalTool, to generate malicious MCP tool descriptions and test whether those altered descriptions can steer agent decisions.

What the research shows

AutoMalTool composes malicious tool packages by editing only metadata such as name, description and input schema. The framework runs four specialised roles: an initial generator for realistic tasks, an oracle that simulates detection, an evaluator that measures behavioural effect, and an optimiser that refines the tool text. The team reports an average generation success of about 85% across tested servers and an overall effective success rate near 35% when attacking real agents. Generated tools also evade current detectors: roughly 11.1% evasion against a tool scanner the authors call MCP Scan and about 23.4% against another detector labelled AIG.

Attacks fall into two straightforward categories. First, incorrect parameter invocation means the tool causes an agent to pass wrong or misleading inputs. Second, output misinterpretation means the tool returns results that the agent misreads or trusts when it should not. The study finds variance by target: some MCP servers and agents are more robust than others. Cline resists manipulation better than Claude Desktop in the paper's tests, and the authors note differences across server types such as Alpaca, WhatsApp and a simple Filesystem server.

Put simply: standard tooling interfaces are a scalable attack surface. The red team here is automated, cheap and fast. The framework produces packages in minutes and uses only metadata edits, which makes detection and manual review harder. The implication is not just academic. In production, a poisoned tool could nudge an automation flow, leak context, or silently change decisions in finance, healthcare or operations.

Practical mitigations and a quick diagram

Diagram-in-words: Agent -> MCP registry -> Tool metadata -> Tool process (sandboxed) -> External resource. The weak link is the metadata and the decision logic that trusts it. Treat metadata as untrusted input.

Mitigations are straightforward, familiar and must be enforced as architecture rules rather than optional hygiene. Key controls are cryptographic signing and provenance for tool packages, strict whitelisting and policy gates, capability bounding and sandboxing, and runtime monitoring of tool selection and invocation patterns. Agents should verify that a tool's declared inputs and outputs match a signed schema before use, and they should fail closed if a tool's behaviour diverges at runtime.

Sign and vet all MCP tools before registry entry.
Run tools in capability-limited sandboxes and enforce strict schemas at call time.
Monitor invocation patterns and drop to a safe state on anomalies.

Operational steps: maintain an auditable registry, enforce automated pre-deployment checks, instrument agent-to-tool channels for detection, and require a rollback path if tools misbehave. Do not rely solely on current detectors; this paper shows evasions are practical. If you operate LLM agents, assume tool metadata is suspect and harden the plumbing now rather than after an incident forces the fix.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools

Authors: Ping He, Changjiang Li, Binbin Zhao, Tianyu Du, and Shouling Ji

The remarkable capability of large language models (LLMs) has led to the wide application of LLM-based agents in various domains. To standardize interactions between LLM-based agents and their environments, model context protocol (MCP) tools have become the de facto standard and are now widely integrated into these agents. However, the incorporation of MCP tools introduces the risk of tool poisoning attacks, which can manipulate the behavior of LLM-based agents. Although previous studies have identified such vulnerabilities, their red teaming approaches have largely remained at the proof-of-concept stage, leaving the automatic and systematic red teaming of LLM-based agents under the MCP tool poisoning paradigm an open question. To bridge this gap, we propose AutoMalTool, an automated red teaming framework for LLM-based agents by generating malicious MCP tools. Our extensive evaluation shows that AutoMalTool effectively generates malicious MCP tools capable of manipulating the behavior of mainstream LLM-based agents while evading current detection mechanisms, thereby revealing new security risks in these agents.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies the security risks arising from model context protocol MCP tools used by large language model based agents. MCP tools standardise interactions between agents and external resources, but their metadata can be poisoned to influence agent behaviour. The work investigates automatic red teaming of LLM based agents under the MCP tool poisoning paradigm, introducing an automated framework to generate malicious MCP tools and examining their ability to manipulate agent actions while evading current detectors. It highlights a scalable attack surface in AI tool ecosystems and the gap in existing defenses against automated tool manipulation.

Approach

The authors present AutoMalTool, a multi agent red teaming framework that creates MCP server packages containing malicious MCP tools. It comprises four specialised agents: Initial Generator, Oracle, Effectiveness Evaluator, and Tool Optimizer. The Initial Generator proposes realistic user tasks for a benign MCP tool and defines malicious behaviours in two categories, incorrect parameter invocation and output results mis interpretation. The Oracle assesses evasiveness by simulating detection, the Effectiveness Evaluator tests whether the malicious tool achieves the intended effect in a simulated environment, and the Tool Optimizer refines the tool description based on Oracle and Evaluator feedback. A static code analysis based modifier updates the MCP server packages while processing only metadata (tool name, description, input schema). The evaluation uses two LLM based agents, Claude Desktop and Cline, three MCP servers (Alpaca, WhatsApp, Filesystem) and 53 MCP tools, testing detection evasion and attack effectiveness. The framework relies on a simulated agent environment with chain of thought prompting to model tool usage stages: tool selection, parameter grounding, and result interpretation. An LLM acts as a behaviour judger to judge whether the malicious description induces the intended effects. Prompts and prompts based prompts are used to guide generation and evaluation.

Key Findings

Generation success rate: AutoMalTool achieves an average generation success rate of about 85.0 percent across tested MCP servers and tool tasks, with an effective success rate of around 35.3 percent for real agents.
Evasiveness against detectors: the generated malicious MCP tools show strong evasion, with evasion rates around 11.1 percent against MCP Scan and 23.4 percent against A I G, indicating current detectors are largely ineffective against these attacks.
Attack variety and robustness: attacks include incorrect parameter invocation and output results mis interpretation. The success varies by target server and agent; some cases reach ESR above 70 percent, while others are lower. Cline generally exhibits greater robustness than Claude Desktop, and Claude Opus 4 1 displays higher resilience overall; GPT 4 1 shows stronger resistance than GPT 5 chat latest in some scenarios.
Operational efficiency: generation costs are economical and fast; the average time to generate a malicious MCP tool is about 200 seconds, with roughly 2 optimization iterations per tool, and an average of about 15 000 tokens used per tool at a cost near 0 03 US dollars.
Use of special tokens: the framework can insert Unicode tokens such as bullet symbols to enhance attack effectiveness, suggesting subtle textual cues can aid manipulation.
Impact across servers: Alpaca and WhatsApp MCP servers yield higher ESR than the Filesystem server, likely due to greater tool complexity and domain mismatch with training data.

Limitations

The study focuses on two malicious behaviours and does not exhaust other potential threats such as exploitation of MCP tool and agent integration, data exfiltration, or shadowing of tools. It evaluates only offline red teaming with simulated agents and does not address live deployment risks. The authors acknowledge ethical concerns and state they will restrict code sharing to verified researchers and will not upload harmful MCP servers to public repositories. There is a possibility that stronger attackers could adapt tools to defeat defenses beyond current detectors.

Why It Matters

The work underscores that MCP tool ecosystems constitute a tangible and scalable attack surface for AI agents. The ability to automatically generate malicious tools that evade detection and steer agent behaviour has practical security implications for critical domains. It calls for stronger tool governance including cryptographic signing, provenance tracking, strict whitelisting, sandboxing, runtime monitoring, and robust policy enforcement. The findings motivate the need for architecture hardening with auditable channels between agents and external tools and for adversarial training to improve agent resilience. The societal risk is that compromised agents could influence decisions in finance, healthcare or public safety, emphasising the importance of secure MCP tool provenance and governance in AI workflows.

Attribution Original paper on arXiv