Function hijacking makes LLM agents pick attacker tools
Agents
Function calling turned Large Language Models (LLMs) from chatbots into agents that take actions. Unsurprisingly, that creates a new way to break things. This study looks at the tool selection step and shows you can poison a single tool description so the agent calls it, whatever the user asked for.
How the attack works
The attacker does not touch the prompt or the tool implementation. They only edit the text description of one listed function. Into that description they insert an adversarial token suffix computed by gradient-based optimisation. The suffix biases the model's next-token generation toward emitting the attacker's function name when it decides which tool to call. In tests, many of those hijacked calls also include valid parameters, not just the name.
The team evaluates on the BFCL benchmark with five models, covering instructed and reasoning variants. They measure two things: whether the model calls the targeted function at all (Function Name Attack Success Rate), and whether it fills the tool's slots plausibly (Slot Filling). Reported success rates for the name hijack sit around 70% to 100% across models.
Some mechanics matter. Longer suffixes help: 60-token payloads perform far better than shorter ones. Placing the poisoned tool earlier in the function list also boosts success. Reasoning models that insert intermediate 'thinking' steps resist a little, but can still be coerced once the attack suppresses that thinking. The attack remains effective when a few extra, out-of-distribution tools are added to the list; heavy perturbation hurts, but training a universal adversarial function on batches of diverse queries restores much of the effect, with direct batch success up to about 0.88 and decent transfer to unseen variants.
Why it matters
This broadens the threat model for agentic systems. If someone can alter tool metadata in a plug-in registry, MCP manifest, or an internal config file, they can redirect calls to an attacker-controlled function. That function could exfiltrate data, delete repositories, send messages, or simply trap the agent in loops. Defences that watch prompts or block obvious jailbreak phrases will not see this coming because the invocation looks like a legitimate, model-chosen call.
There are limits. The study focuses on mid-sized models and a specific benchmark; behaviour with much larger models or very large tool sets is open. The effect depends on description placement and token budget. Detection and operational constraints are not fully explored. Still, the commercial point is clear: in agent platforms, tool description integrity is now security-critical. Universal poisoning makes it scalable. Watch this space.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies a novel class of attacks on function calling interfaces of agentic large language models. It shows that an adversary who can edit the textual description of a listed function can insert adversarial tokens that change the model's tool selection, forcing the agent to invoke a chosen attacker function regardless of the user prompt. This matters because agentic systems grant models the ability to execute external functions, so hijacked calls can enable data theft or tampering, disruptive behaviour, or execution of harmful actions while appearing as legitimate function calls.
Approach
The authors adapt a gradient-based jailbreaking method to the function calling task, inserting an adversarial suffix inside the description of a target function to bias auto-regressive generation towards that function name. The attacker model assumes control only of the function description, not of the user prompt or function implementations. Experiments use the BFCL_v3_multiple benchmark of 200 payloads and five target models: two instructed models and three reasoning variants. Evaluation metrics are Function Name ASR, which checks whether the model calls the targeted function, and Slot Filling ASR, which checks whether the generated call has valid parameters. Baselines include unperturbed inference, a function-injection baseline (generated by a large model in zero-shot and few-shot modes), and an MCP preference manipulation attack. The study also explores variations in adversarial suffix length (10, 35, 60 tokens), the effect of function position and number in the payload, robustness to added noise functions, and universal attack variants trained over batches of queries or payload permutations. Synthetic query augmentations were produced to test universality across reformulations, argument changes and multiple intents.
Key Findings
- High success rates: FHA obtains Function Name ASRs between about 70% and 100% on the BFCL benchmark across the five models tested, with many hijacked calls also passing Slot Filling checks.
- Model differences: reasoning models that generate intermediate "thinking" steps are more resilient, requiring extra optimisation to suppress thinking, but can still be hijacked.
- Suffix size and position matter: larger adversarial suffixes (60 tokens) yield much higher ASR; placing the target function earlier in the function list increases attack efficiency.
- Robustness to moderate perturbation: attacks transfer when up to around three additional out-of-distribution functions are added; simple FHA fails under heavy perturbation but universal variants and batch training increase robustness.
- Universal attacks: training on batches of diverse reformulations or payload variants can produce a single adversarial function that hijacks multiple queries or payload versions, with direct batch ASRs up to about 0.88 and transferred ASRs showing substantial generalisation.
- Baselines comparison: MPMA-style preference edits show low effectiveness on BFCL, while function-injection can succeed in some settings; FHA is more general and scalable across diverse payloads.
Limitations
Experiments focus on models mainly in the 1B to 14B parameter range and on the BFCL benchmark; behaviour with much larger models, substantially larger tool sets or in other MCP environments remains to be determined. The attack's effectiveness depends in part on semantic proximity between target function metadata and the prompt and on the proportion and position of adversarial tokens in the input. Detection and real-world deployment constraints were not exhaustively explored.
Implications
Offensive consequences are direct: an attacker able to alter function descriptions could cause agents to call attacker-chosen functions that perform data exfiltration, deletion, unauthorised actions such as repository deletion or message sending, or enable jailbreak-style harmful outputs while appearing as valid calls. Universal adversarial functions increase scalability, allowing a single poisoned tool to hijack many queries or survive some codebase changes. These results highlight that function description integrity and strict invocation controls are critical attack surface items for agentic systems.