ShortSpan.ai logo

Function hijacking makes LLM agents pick attacker tools

Agents
Published: Thu, Apr 23, 2026 • By Adrian Calder
Function hijacking makes LLM agents pick attacker tools
New research shows an attacker can poison a tool's description to make Large Language Model (LLM) agents call that tool regardless of the prompt. Success rates hit 70–100% across several models, with universal variants working across queries. It shifts risk from prompts to tool metadata, with clear implications for agent plug-in ecosystems.

Function calling turned Large Language Models (LLMs) from chatbots into agents that take actions. Unsurprisingly, that creates a new way to break things. This study looks at the tool selection step and shows you can poison a single tool description so the agent calls it, whatever the user asked for.

How the attack works

The attacker does not touch the prompt or the tool implementation. They only edit the text description of one listed function. Into that description they insert an adversarial token suffix computed by gradient-based optimisation. The suffix biases the model's next-token generation toward emitting the attacker's function name when it decides which tool to call. In tests, many of those hijacked calls also include valid parameters, not just the name.

The team evaluates on the BFCL benchmark with five models, covering instructed and reasoning variants. They measure two things: whether the model calls the targeted function at all (Function Name Attack Success Rate), and whether it fills the tool's slots plausibly (Slot Filling). Reported success rates for the name hijack sit around 70% to 100% across models.

Some mechanics matter. Longer suffixes help: 60-token payloads perform far better than shorter ones. Placing the poisoned tool earlier in the function list also boosts success. Reasoning models that insert intermediate 'thinking' steps resist a little, but can still be coerced once the attack suppresses that thinking. The attack remains effective when a few extra, out-of-distribution tools are added to the list; heavy perturbation hurts, but training a universal adversarial function on batches of diverse queries restores much of the effect, with direct batch success up to about 0.88 and decent transfer to unseen variants.

Why it matters

This broadens the threat model for agentic systems. If someone can alter tool metadata in a plug-in registry, MCP manifest, or an internal config file, they can redirect calls to an attacker-controlled function. That function could exfiltrate data, delete repositories, send messages, or simply trap the agent in loops. Defences that watch prompts or block obvious jailbreak phrases will not see this coming because the invocation looks like a legitimate, model-chosen call.

There are limits. The study focuses on mid-sized models and a specific benchmark; behaviour with much larger models or very large tool sets is open. The effect depends on description placement and token budget. Detection and operational constraints are not fully explored. Still, the commercial point is clear: in agent platforms, tool description integrity is now security-critical. Universal poisoning makes it scalable. Watch this space.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Authors: Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi, and John D. Kelleher
The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies a novel class of attacks on function calling interfaces of agentic large language models. It shows that an adversary who can edit the textual description of a listed function can insert adversarial tokens that change the model's tool selection, forcing the agent to invoke a chosen attacker function regardless of the user prompt. This matters because agentic systems grant models the ability to execute external functions, so hijacked calls can enable data theft or tampering, disruptive behaviour, or execution of harmful actions while appearing as legitimate function calls.

Approach

The authors adapt a gradient-based jailbreaking method to the function calling task, inserting an adversarial suffix inside the description of a target function to bias auto-regressive generation towards that function name. The attacker model assumes control only of the function description, not of the user prompt or function implementations. Experiments use the BFCL_v3_multiple benchmark of 200 payloads and five target models: two instructed models and three reasoning variants. Evaluation metrics are Function Name ASR, which checks whether the model calls the targeted function, and Slot Filling ASR, which checks whether the generated call has valid parameters. Baselines include unperturbed inference, a function-injection baseline (generated by a large model in zero-shot and few-shot modes), and an MCP preference manipulation attack. The study also explores variations in adversarial suffix length (10, 35, 60 tokens), the effect of function position and number in the payload, robustness to added noise functions, and universal attack variants trained over batches of queries or payload permutations. Synthetic query augmentations were produced to test universality across reformulations, argument changes and multiple intents.

Key Findings

  • High success rates: FHA obtains Function Name ASRs between about 70% and 100% on the BFCL benchmark across the five models tested, with many hijacked calls also passing Slot Filling checks.
  • Model differences: reasoning models that generate intermediate "thinking" steps are more resilient, requiring extra optimisation to suppress thinking, but can still be hijacked.
  • Suffix size and position matter: larger adversarial suffixes (60 tokens) yield much higher ASR; placing the target function earlier in the function list increases attack efficiency.
  • Robustness to moderate perturbation: attacks transfer when up to around three additional out-of-distribution functions are added; simple FHA fails under heavy perturbation but universal variants and batch training increase robustness.
  • Universal attacks: training on batches of diverse reformulations or payload variants can produce a single adversarial function that hijacks multiple queries or payload versions, with direct batch ASRs up to about 0.88 and transferred ASRs showing substantial generalisation.
  • Baselines comparison: MPMA-style preference edits show low effectiveness on BFCL, while function-injection can succeed in some settings; FHA is more general and scalable across diverse payloads.

Limitations

Experiments focus on models mainly in the 1B to 14B parameter range and on the BFCL benchmark; behaviour with much larger models, substantially larger tool sets or in other MCP environments remains to be determined. The attack's effectiveness depends in part on semantic proximity between target function metadata and the prompt and on the proportion and position of adversarial tokens in the input. Detection and real-world deployment constraints were not exhaustively explored.

Implications

Offensive consequences are direct: an attacker able to alter function descriptions could cause agents to call attacker-chosen functions that perform data exfiltration, deletion, unauthorised actions such as repository deletion or message sending, or enable jailbreak-style harmful outputs while appearing as valid calls. Universal adversarial functions increase scalability, allowing a single poisoned tool to hijack many queries or survive some codebase changes. These results highlight that function description integrity and strict invocation controls are critical attack surface items for agentic systems.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.