MalTool Shows LLM Agents Can Run Hidden Malware

Agents

Published: Fri, Feb 13, 2026 • By Natalie Kestrel

Researchers build MalTool to test whether coding Large Language Models (LLMs) can automatically generate malicious tools for AI agents. The system produces thousands of malicious or trojanised tools, achieves a 1.0 attack success rate with a verifier, and exposes that common detectors often miss these threats, prompting calls for code vetting and runtime controls.

AI agents that extend themselves with third party tools are useful. They are also a new attack surface. A paper called MalTool examines what happens when an attacker hides malicious behaviour inside tool code rather than relying on deceptive names or descriptions. The study is targeted, methodical and, at times, quietly alarming.

How MalTool works

The authors focus on coding Large Language Models, a class of LLMs tuned to produce source code. They build a framework that instructs a coding LLM to generate tools exhibiting specified malicious behaviours mapped to the confidentiality, integrity and availability triad. An automated verifier then checks whether the generated tool actually performs the targeted malicious action and whether its structure differs sufficiently from earlier samples. The verifier feeds back to the generator until the tool passes.

Using this pipeline the researchers produce two malicious tool corpora. Dataset I contains 1,200 standalone malicious tools. Dataset II contains 5,287 real world tools with embedded malicious functions, essentially trojanised versions of benign projects. The paper reports an attack success rate of 1.0 across behaviours and multiple coding LLMs when the verifier is used, and shows that even safety aligned models can produce working malicious code, although efficiency varies by model.

The team evaluates existing detection approaches, including commercial malware scanners such as VirusTotal and several LLM based detectors. Detection performance is uneven. VirusTotal shows high false negative rates and some targeted behaviours, notably resource hijacking and denial of service, are particularly hard to detect. Benign tools can trigger false positives in some detectors, which complicates practical deployment of automated checks.

What the paper misses or understates

The study is careful about its scope. It runs on locally controlled infrastructure, uses synthetic attack instantiations and focuses on code level analysis rather than end to end system protections. That matters. A real attacker will chain stages, exploit environmental assumptions, and test evasions beyond what a verifier checks. The authors note these limitations, and they also caution that their benign dataset may itself contain undisclosed malicious instances that could affect false positive rates.

Still, the central finding stands: coding LLMs can be leveraged to produce operational malicious tools at scale and common detectors struggle to catch them. That combination makes the risk realistic rather than theoretical. The paper adds value by releasing large benign tool datasets for benchmarking, which ought to help defenders measure detection at scale.

The research supports a narrow set of practical responses. Code level vetting at distribution points, stronger runtime monitoring and tighter isolation when agents run third party tools are reasonable mitigations that the authors advocate. The work does not offer a silver bullet; it shows where defenders need to work harder and where current tooling fails.

For security teams the paper is a prompt to treat agent toolchains like any other supply chain: assume code can be malicious, validate behaviour beyond names and descriptions, and build layered controls rather than trusting any single detector or model alignment claim.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

MalTool: Malicious Tool Attacks on LLM Agents

Authors: Yuepeng Hu, Yuqi Jia, Mengyuan Li, Dawn Song, and Neil Gong

In a malicious tool attack, an attacker uploads a malicious tool to a distribution platform; once a user installs the tool and the LLM agent selects it during task execution, the tool can compromise the user's security and privacy. Prior work primarily focuses on manipulating tool names and descriptions to increase the likelihood of installation by users and selection by LLM agents. However, a successful attack also requires embedding malicious behaviors in the tool's code implementation, which remains largely unexplored. In this work, we bridge this gap by presenting the first systematic study of malicious tool code implementations. We first propose a taxonomy of malicious tool behaviors based on the confidentiality-integrity-availability triad, tailored to LLM-agent settings. To investigate the severity of the risks posed by attackers exploiting coding LLMs to automatically generate malicious tools, we develop MalTool, a coding-LLM-based framework that synthesizes tools exhibiting specified malicious behaviors, either as standalone tools or embedded within otherwise benign implementations. To ensure functional correctness and structural diversity, MalTool leverages an automated verifier that validates whether generated tools exhibit the intended malicious behaviors and differ sufficiently from prior instances, iteratively refining generations until success. Our evaluation demonstrates that MalTool is highly effective even when coding LLMs are safety-aligned. Using MalTool, we construct two datasets of malicious tools: 1,200 standalone malicious tools and 5,287 real-world tools with embedded malicious behaviors. We further show that existing detection methods, including commercial malware detection approaches such as VirusTotal and methods tailored to the LLM-agent setting, exhibit limited effectiveness at detecting the malicious tools, highlighting an urgent need for new defenses.

🔍 ShortSpan Analysis of the Paper

Problem

MalTool studies code level malicious tool attacks on LLM agents, addressing a gap where attackers embed harmful behaviours directly in tool code rather than relying solely on deceitful tool names or descriptions. In such attacks an adversary uploads a tool to a distribution platform; when a user installs it and an LLM agent selects and executes the tool during task completion, user security and privacy can be compromised. The work proposes a CIA triad based taxonomy of malicious tool behaviours tailored to LLM agent settings and investigates the risk of attackers using coding LLMs to automatically generate malicious tools that function either as standalone utilities or embedded within benign implementations.

Approach

The authors introduce MalTool, a coding LLM based framework that synthesises tools exhibiting specified malicious behaviours and uses an automated verifier to ensure functional correctness and structural diversity. The verifier tests whether a generated tool realises the target behaviour and remains sufficiently different from prior instances, iterating until success. System prompts guide the coding LLM to produce diverse tools and incorporate verifier feedback to improve efficiency. Two datasets are produced: Dataset I with 1200 standalone malicious tools and Dataset II with 5 287 real world tools containing embedded malicious behaviours, drawn from real tool collections. Trojan tools are created by embedding a malicious function into benign tools. A large benign real world tool dataset, Dataset III, is also collected to assess false positives. The study evaluates detection methods including VirusTotal and LLM based detectors such as Tencent A I G, Cisco MCP Scanner, AntGroup MCPScan, and a Combined Scanner.

Key Findings

MalTool achieves an attack success rate of 1.0 across all malicious behaviours and across multiple coding LLMs, including safety aligned open weight models and closed source models, when the verifier is employed.
Without the verifier, attack success rates vary with the model and behaviour, demonstrating the importance of automated correctness and diversity verification for scalable tool generation.
MalTool produces diverse implementations; the diversity verifier reduces structural similarity and lowers the number of generation verification iterations, with the similarity threshold tuned to balance diversity and efficiency.
Dataset I contains 1200 standalone malicious tools and Dataset II contains 5 287 Trojan tools embedded into 5 287 benign real world tools, distributed across 12 malicious behaviours, ensuring broad coverage of potential threats in real world tool ecosystems.
Detection methods struggle to reliably identify malicious tools. VirusTotal shows high false negative rates; other detectors exhibit mixed performance, with Resource Hijacking and Denial of Service behaviours particularly hard to detect. Benign real world tools also trigger false positives with some detectors, impairing practical deployment of detectors.
Even safety aligned coding LLMs can produce malicious tools with the specified behaviours, though generation efficiency varies. Closed source models show low monetary costs per successful tool generation, highlighting practical feasibility of such attacks.
The authors provide public benign real world tool datasets for defence benchmarking and advocate for code level vetting, runtime monitoring and governance around tool distribution and execution as critical defenses.

Limitations

The study relies on synthetic data and locally controlled infrastructure for attack instantiations, and does not exhaust all possible long horizon or multi stage attack scenarios. Dataset III focuses on real world tools which may include some malicious instances, and the evaluation considers program analysis based detectors rather than end to end system protections. The authors acknowledge limitations in generalising findings to all real world deployments and note that some results may reflect detector false positives on benign tools.

Why It Matters

The work demonstrates that malicious behaviour can be hidden inside tools used by AI agents, not just within their names or descriptions, creating a realistic risk when tools are automatically selected and executed. MalTool enables scalable threat modelling and red team style testing of AI driven workflows by automatically generating and verifying malicious tool implementations. The release of large benign tool datasets provides benchmarks for defending against such attacks while emphasising the need for new defenses including code level vetting, runtime monitoring, isolation, and governance around tool distribution and execution. The results underline the limitations of existing detectors and motivate research that jointly reasons about tool code, natural language descriptions, and semantic consistency to improve security in AI tool ecosystems.

Attribution Original paper on arXiv