Harmless Tool Chains Jailbreak LLM Agents

Agents

Published: Wed, Oct 01, 2025 • By Dr. Marcus Halden

Researchers present STAC (Sequential Tool Attack Chaining), an automated method that links benign-looking tool calls into multi-turn attacks against tool-enabled Large Language Models (LLMs). They test 483 chained cases and 1,352 interactions, finding attack success rates above 90% for most agents. A reasoning-driven defence helps but does not eliminate the risk.

Large Language Model (LLM) agents that call external tools are moving from conversation partners to active system operators. That change brings new security questions. The paper called STAC (Sequential Tool Attack Chaining) shows how small, harmless-looking tool calls can be stitched together across multiple turns to produce a harmful end state that no single prompt would reveal.

How STAC works

The authors build an automated pipeline with five components labelled Generator, Verifier, Prompt Writer, Planner and Judge. The Generator designs chains of two to six tool calls whose final step produces the malicious outcome. The Verifier runs each step in a simulated environment to ensure the chain is executable. The Prompt Writer crafts seemingly innocuous, multi-turn attacker prompts. The Planner uses those prompts to lead the target agent through the chain. The Judge scores prompt harmlessness, attack progress and agent helpfulness after each run.

The team created 483 STAC cases covering 1,352 user–agent–environment interactions, ten distinct failure modes, and domains such as banking, travel and web navigation. They tested a range of agent families, including frontier models such as GPT-4.1. Results are stark: attack success rates exceed 90% for most agents. Prompts score near-zero on harmlessness at the single-prompt level, and refusal rates remain low, which explains why the chain slips past standard defences that inspect individual messages.

Defence findings and limits

Existing prompt-based defences offer limited protection because they evaluate single turns or isolated messages. The authors try two reasoning-driven defence prompts. One asks the agent to weigh harms and benefits of a tool call before execution; the other requires a summary of user intent across the multi-turn history. The best of these reduced attack success by up to 28.8% and raised initial refusal rates. That is useful but far from decisive. Effectiveness drops over successive turns, underlining that prompt fixes alone do not fully address sequence-level threats.

The study has clear bounds. The evaluation uses simulated Python environments from SHADE Arena and Agent Safety Bench and focuses on prompt-based defences. The authors explicitly call for broader guardrails and more realistic deployments for further study. They also document responsible disclosure measures and plan to release data and code to aid reproducibility.

In plain terms, STAC exposes a new class of practical, automated jailbreaks: an attacker can design a harmless-looking recipe and get a capable agent to execute it step by step. Defending against this requires monitoring sequences of actions and reasoning about cumulative effects, not just scanning single prompts.

Operational takeaways: stress-test agents with sequential chain scenarios; implement sequence-aware monitoring and controls over tool invocation; favour guardrails that reason about cumulative effects rather than single-turn checks.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Authors: Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

🔍 ShortSpan Analysis of the Paper

Problem

As large language models progress to autonomous agents capable of tool use, security risks extend beyond conventional content safety. This paper formalises Sequential Tool Attack Chaining STAC, a multi turn attack that links several tool calls that seem harmless in isolation but together enable harmful operations only visible at the final execution step. The authors automatically generate and systematically evaluate 483 STAC cases comprising 1 352 user agent environment interactions across diverse domains tasks agent types and 10 failure modes. Evaluations show that state of the art LLM agents including GPT 4.1 are highly vulnerable with attack success rates exceeding 90 percent in most cases. The core of STAC is a closed loop pipeline that synthesises executable multi step tool chains validates them in environment execution and reverse engineers stealthy multi turn prompts that reliably induce agents to execute the verified malicious sequence. The paper also analyses defence options and finds that existing prompt based defences offer limited protection. A new reasoning driven defence prompt is proposed that reduces ASR by up to 28.8 percent. The results emphasise a gap: defending tool enabled agents requires reasoning over entire action sequences and their cumulative effects rather than evaluating isolated prompts or responses.

Approach

STAC is implemented as an automated pipeline with five components: Generator Verifier Prompt Writer Planner and Judge. The Generator designs a chain of 2 to 6 target tool calls whose final call achieves a malicious end goal. The Verifier executes each step in the environment observing outputs and revising tool calls as needed to ensure executability. The Prompt Writer creates stealthy attacker prompts that appear benign and logically lead to the initial tool calls. The Planner interactively jailbreaks the agent over multiple turns using a synthetic multi turn context to achieve the end goal. The Judge assesses after each evaluation turn three metrics: Prompt Harmlessness (how innocuous the prompt appears on a seven point scale), Attack Goal Progress (how close the agent is to the end goal on a seven point scale), and Agent Helpfulness (the agent’s willingness to assist on a five point scale). The framework was evaluated on environments from SHADE Arena and Agent Safety Bench covering domains such as banking travel workspace and web navigation. A STAC benchmark of 483 trajectories spanning diverse scenarios and 10 agent specific failure modes was constructed and tested against eight LLM families including open weight and frontier proprietary models. The Generator Verifier Planner and Judge run on GPT 4 1 for most components while the Prompt Writer uses Qwen3 32B due to lower capability requirements. Agents were evaluated for up to three user turns with up to ten agent turns per turn.

Defence evaluation includes existing prompt based baselines plus two novel defence prompts: one that instructs the agent to reason about harms and benefits of a tool call before executing it and another that requires the agent to summarise the user intent over the multi turn history. These prompts target diffused malicious intent rather than single prompts. Metrics and tables in the study compare STAC to a single turn attack and to an X Teaming style baseline adapted from prior work on textual jailbreaks.

Key Findings

STAC yields high attack success rates, exceeding 90 percent for most agents, including capable models such as GPT 4 1.
Even agents with strong safeguards are vulnerable because malicious intent is distributed across multiple benign looking steps and only becomes apparent as a full sequence of tool calls is executed.
STAC demonstrates stealth on average at the prompt level with Prompt Harmlessness scores near zero and low Refusal Rates at the agent level, underscoring the difficulty of detecting multi turn tool based attacks.
Compared with a single turn attack and with adapted X Teaming methods, STAC achieves substantially higher success rates, indicating that distributing the attack over multiple turns and ensuring tool chain executability are critical for bypassing safety measures.
A reasoning based defence prompt significantly strengthens initial protection, reducing ASR by up to 28.8 percent and increasing the initial Refusal Rate, though effectiveness diminishes over successive turns, highlighting the limits of prompt based countermeasures.
Findings indicate that defending tool enabled agents requires evaluating entire action sequences and their cumulative effects rather than assessing isolated prompts or responses.
Limitations in the study include the use of simulated Python environments and a focus on prompt based defences; the authors call for further work on more robust guardrails and comprehensive evaluations beyond prompt level.

Limitations

The evaluation relies on simulated Python based environments from SHADE Arena and Agent Safety Bench which may not capture the full range of real world deployments. The defence analysis is limited to prompt based mechanisms and does not incorporate model retraining or other guardrail approaches. The authors acknowledge ethical considerations due to the dual use of the attack framework and outline responsible practices including controlled environments and sharing of defensive methods alongside attacks. Reproducibility is supported by appendices containing system prompts and defence prompts and by planned public release of data and code.

Why It Matters

The work reveals a fundamental shift in AI safety as LLMs transition from chatbots to tool using agents. It shows how high level tool orchestration can enable dangerous environmental changes and how standard prompt based defences are insufficient against multi turn chain attacks. The findings encourage stress testing of tool enabled agents with sequential chain scenarios, the implementation of sequence aware monitoring and controls over tool invocation, and stronger prompt guardrails that reason about cumulative effects. While not addressing societal impact directly, the vulnerability has broad security implications for deploying autonomous AI in critical settings and could enable automated misuse or escalation if left unchecked. The authors call for continued research into more robust defence strategies and for integrating security considerations into agent design from the outset.

Attribution Original paper on arXiv