Harmless Tool Chains Jailbreak LLM Agents
Agents
Large Language Model (LLM) agents that call external tools are moving from conversation partners to active system operators. That change brings new security questions. The paper called STAC (Sequential Tool Attack Chaining) shows how small, harmless-looking tool calls can be stitched together across multiple turns to produce a harmful end state that no single prompt would reveal.
How STAC works
The authors build an automated pipeline with five components labelled Generator, Verifier, Prompt Writer, Planner and Judge. The Generator designs chains of two to six tool calls whose final step produces the malicious outcome. The Verifier runs each step in a simulated environment to ensure the chain is executable. The Prompt Writer crafts seemingly innocuous, multi-turn attacker prompts. The Planner uses those prompts to lead the target agent through the chain. The Judge scores prompt harmlessness, attack progress and agent helpfulness after each run.
The team created 483 STAC cases covering 1,352 user–agent–environment interactions, ten distinct failure modes, and domains such as banking, travel and web navigation. They tested a range of agent families, including frontier models such as GPT-4.1. Results are stark: attack success rates exceed 90% for most agents. Prompts score near-zero on harmlessness at the single-prompt level, and refusal rates remain low, which explains why the chain slips past standard defences that inspect individual messages.
Defence findings and limits
Existing prompt-based defences offer limited protection because they evaluate single turns or isolated messages. The authors try two reasoning-driven defence prompts. One asks the agent to weigh harms and benefits of a tool call before execution; the other requires a summary of user intent across the multi-turn history. The best of these reduced attack success by up to 28.8% and raised initial refusal rates. That is useful but far from decisive. Effectiveness drops over successive turns, underlining that prompt fixes alone do not fully address sequence-level threats.
The study has clear bounds. The evaluation uses simulated Python environments from SHADE Arena and Agent Safety Bench and focuses on prompt-based defences. The authors explicitly call for broader guardrails and more realistic deployments for further study. They also document responsible disclosure measures and plan to release data and code to aid reproducibility.
In plain terms, STAC exposes a new class of practical, automated jailbreaks: an attacker can design a harmless-looking recipe and get a capable agent to execute it step by step. Defending against this requires monitoring sequences of actions and reasoning about cumulative effects, not just scanning single prompts.
- Operational takeaways: stress-test agents with sequential chain scenarios; implement sequence-aware monitoring and controls over tool invocation; favour guardrails that reason about cumulative effects rather than single-turn checks.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
🔍 ShortSpan Analysis of the Paper
Problem
As large language models progress to autonomous agents capable of tool use, security risks extend beyond conventional content safety. This paper formalises Sequential Tool Attack Chaining STAC, a multi turn attack that links several tool calls that seem harmless in isolation but together enable harmful operations only visible at the final execution step. The authors automatically generate and systematically evaluate 483 STAC cases comprising 1 352 user agent environment interactions across diverse domains tasks agent types and 10 failure modes. Evaluations show that state of the art LLM agents including GPT 4.1 are highly vulnerable with attack success rates exceeding 90 percent in most cases. The core of STAC is a closed loop pipeline that synthesises executable multi step tool chains validates them in environment execution and reverse engineers stealthy multi turn prompts that reliably induce agents to execute the verified malicious sequence. The paper also analyses defence options and finds that existing prompt based defences offer limited protection. A new reasoning driven defence prompt is proposed that reduces ASR by up to 28.8 percent. The results emphasise a gap: defending tool enabled agents requires reasoning over entire action sequences and their cumulative effects rather than evaluating isolated prompts or responses.
Approach
STAC is implemented as an automated pipeline with five components: Generator Verifier Prompt Writer Planner and Judge. The Generator designs a chain of 2 to 6 target tool calls whose final call achieves a malicious end goal. The Verifier executes each step in the environment observing outputs and revising tool calls as needed to ensure executability. The Prompt Writer creates stealthy attacker prompts that appear benign and logically lead to the initial tool calls. The Planner interactively jailbreaks the agent over multiple turns using a synthetic multi turn context to achieve the end goal. The Judge assesses after each evaluation turn three metrics: Prompt Harmlessness (how innocuous the prompt appears on a seven point scale), Attack Goal Progress (how close the agent is to the end goal on a seven point scale), and Agent Helpfulness (the agent’s willingness to assist on a five point scale). The framework was evaluated on environments from SHADE Arena and Agent Safety Bench covering domains such as banking travel workspace and web navigation. A STAC benchmark of 483 trajectories spanning diverse scenarios and 10 agent specific failure modes was constructed and tested against eight LLM families including open weight and frontier proprietary models. The Generator Verifier Planner and Judge run on GPT 4 1 for most components while the Prompt Writer uses Qwen3 32B due to lower capability requirements. Agents were evaluated for up to three user turns with up to ten agent turns per turn.
Defence evaluation includes existing prompt based baselines plus two novel defence prompts: one that instructs the agent to reason about harms and benefits of a tool call before executing it and another that requires the agent to summarise the user intent over the multi turn history. These prompts target diffused malicious intent rather than single prompts. Metrics and tables in the study compare STAC to a single turn attack and to an X Teaming style baseline adapted from prior work on textual jailbreaks.
Key Findings
- STAC yields high attack success rates, exceeding 90 percent for most agents, including capable models such as GPT 4 1.
- Even agents with strong safeguards are vulnerable because malicious intent is distributed across multiple benign looking steps and only becomes apparent as a full sequence of tool calls is executed.
- STAC demonstrates stealth on average at the prompt level with Prompt Harmlessness scores near zero and low Refusal Rates at the agent level, underscoring the difficulty of detecting multi turn tool based attacks.
- Compared with a single turn attack and with adapted X Teaming methods, STAC achieves substantially higher success rates, indicating that distributing the attack over multiple turns and ensuring tool chain executability are critical for bypassing safety measures.
- A reasoning based defence prompt significantly strengthens initial protection, reducing ASR by up to 28.8 percent and increasing the initial Refusal Rate, though effectiveness diminishes over successive turns, highlighting the limits of prompt based countermeasures.
- Findings indicate that defending tool enabled agents requires evaluating entire action sequences and their cumulative effects rather than assessing isolated prompts or responses.
- Limitations in the study include the use of simulated Python environments and a focus on prompt based defences; the authors call for further work on more robust guardrails and comprehensive evaluations beyond prompt level.
Limitations
The evaluation relies on simulated Python based environments from SHADE Arena and Agent Safety Bench which may not capture the full range of real world deployments. The defence analysis is limited to prompt based mechanisms and does not incorporate model retraining or other guardrail approaches. The authors acknowledge ethical considerations due to the dual use of the attack framework and outline responsible practices including controlled environments and sharing of defensive methods alongside attacks. Reproducibility is supported by appendices containing system prompts and defence prompts and by planned public release of data and code.
Why It Matters
The work reveals a fundamental shift in AI safety as LLMs transition from chatbots to tool using agents. It shows how high level tool orchestration can enable dangerous environmental changes and how standard prompt based defences are insufficient against multi turn chain attacks. The findings encourage stress testing of tool enabled agents with sequential chain scenarios, the implementation of sequence aware monitoring and controls over tool invocation, and stronger prompt guardrails that reason about cumulative effects. While not addressing societal impact directly, the vulnerability has broad security implications for deploying autonomous AI in critical settings and could enable automated misuse or escalation if left unchecked. The authors call for continued research into more robust defence strategies and for integrating security considerations into agent design from the outset.