ShortSpan.ai logo

TRACE automates agentic jailbreaking via staged tasks

Agents
Published: Mon, Jun 01, 2026 • By James Armitage
TRACE automates agentic jailbreaking via staged tasks
TRACE shows how attackers steer autonomous LLM agents by splitting malicious goals into subtasks and hiding the dangerous steps inside plausible roles and environments. A feedback loop evolves the disguise until execution succeeds. In benchmarks, TRACE hits up to 100% bypass and lifts task success, with controlled cyberattack demos.

Most teams still treat agent safety like a fancy content filter. Block the obvious bad strings, sprinkle in a refusal template, job done. TRACE puts a hole through that comfort. It shows that if you let a Large Language Model (LLM) act as an agent that plans, calls tools and updates state, a patient attacker can walk it to harm by staging the task, not the prompt.

How TRACE works

The framework starts by decomposing a malicious goal into multiple candidate subtask sequences, then picks the path with the fewest obviously harmful steps. Harmless bits go through unchanged. Risky steps get wrapped in a task-aware scenario made of four parts: role, environment, directive and heuristic. The trick is to make each dangerous action look like a sensible move within a legitimate context, such as a constrained lab role or a diagnostic setting.

TRACE then iteratively evolves those scenario components. A Q-learning-inspired mechanism samples transformation actions to tweak the role, environment, directive or heuristic, guided by execution feedback that rewards real task progress and penalises refusals. A memory module keeps scenarios and component variants that worked before, so later attempts start stronger. The attacker model is black box: only interact with the agent interface and observe traces. In the implementation described, strong LLMs act as internal judges and generators.

On two agent security benchmarks, AgentHarm and AdvCUA, TRACE consistently beats representative jailbreak baselines across multiple advanced LLM agents. The paper reports up to 100 percent bypass rate and a 0.73 average success score on AgentHarm, and lifts performance on AdvCUA to 0.50, while baselines often stumble after early refusals. Controlled case studies include guiding an agent through forensic and sandbox scenarios to end with a remote exploit for a stack corruption task, and splitting a common-modulus cryptanalysis into data collection, local recovery and verification.

Why this matters

The honest read is blunt: this is not a prompt problem, it is a workflow problem. If your system lets an agent orchestrate plans and tools, you have built an execution engine that can be steered by disguising intent over time. Filters that scan for explicit harm miss staged plans that look like maintenance, audit or research. Yes, the experiments are sandboxed and the defences tested are not exhaustive. But the pattern is what counts. TRACE’s memory reuse grows during optimisation, meaning the attack learns what works and reuses it. In other words, this gets easier with practice.

My assessment: in organisations that deploy autonomous agents without tight control of plan provenance, state transitions and tool invocation, safety alignment is theatre. If an agent can do valuable work, it can be induced to do the same work for the wrong reason. Ship that with your eyes open.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking

Authors: Churui Zeng, Weiwei Qi, Kedong Xiu, Tianhang Zheng, Chaochao Lu, Liang He, Zhan Qin, and Kui Ren
The rise of LLM agents introduces a new threat by enabling planning, coding, and even end-to-end execution of expert-level attack workflows. However, this threat remains underexplored and underestimated since (i) safety alignment prevents LLMs from directly generating harmful instructions, and (ii) most existing jailbreak methods cannot consistently induce agents to execute malicious operations. In this paper, we propose TRACE, a practical agentic jailbreaking framework to further reveal the risks of this threat surface. To conceal the malicious intent, TRACE decomposes a malicious task into multiple subtask sequences under different schemes and selects the sequence with the fewest explicitly harmful subtasks. TRACE then disguises the remaining harmful subtasks as benign-looking instructions by embedding them in task-aware scenarios with related roles, environments, directives, and heuristics. The scenarios are iteratively evolved through well-defined transformation actions, which are sampled by a Q-learning-inspired mechanism, for inducing the agent to execute on the harmful subtasks. Extensive evaluations on AgentHarm and AdvCUA show that TRACE consistently outperforms existing jailbreak baselines across multiple advanced LLM agents, achieving up to 100% bypass rate and 0.73 average success score. We also demonstrate the effectiveness of TRACE in controlled cyberattack instances. Our code and demos are available at https://github.com/ZJU-LLM-Safety/TRACE.git.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies how autonomous LLM agents can be induced to plan and execute multi-step harmful workflows despite model safety alignment. It argues that one-shot jailbreak prompts are insufficient in agentic settings because preserving adversarial intent requires sustained coordination across planning, tool use, execution monitoring and state updates. The work exposes a realistic threat surface where a black-box adversary can interact with an agent interface to covertly convert high-level malicious goals into executable attacks.

Approach

TRACE is a three-stage, task-aware, adaptive framework for agentic jailbreaking. Stage one decomposes a harmful goal into multiple candidate subtask sequences and selects a semantically consistent sequence with the fewest explicitly harmful subtasks. Stage two executes harmless subtasks directly and converts rejected or risky subtasks into task-aware disguising scenarios composed of four components: role, environment, directive and heuristic. Stage three evolves these scenario components through a feedback-driven search that samples component-level transformation actions from a transition matrix and updates it using a Q-learning-inspired rule. A memory module retains successful scenarios and component variants for reuse. Execution feedback scores balance subtask progress and refusal behaviour to guide evolution. TRACE conditions initial component pools on available tools and uses strong LLMs as internal judges and component generators in the implementation described.

Key Findings

  • TRACE substantially outperforms representative jailbreak baselines on two agent security benchmarks. Across AgentHarm and AdvCUA and three advanced backbone models, it achieves the highest average success score and bypass rate, with reported improvements from 0.59 to 0.73 ASS on AgentHarm and from 0.27 to 0.50 ASS on AdvCUA over the strongest baselines. It reports up to 100 percent bypass rate and a 0.73 average success score in evaluations.
  • Existing methods can achieve non-trivial bypass rates but often fail to sustain end-to-end malicious execution. TRACE preserves adversarial intent more reliably through semantic consistency checks and execution feedback, yielding higher task completion quality as thresholds rise.
  • TRACE is effective in controlled cyberattack instances. Case studies include a stack corruption task where the agent is guided through forensic and sandbox scenarios to produce a remote exploit, and a common-modulus cryptanalysis task decomposed into data collection, local recovery and verification subtasks.
  • Self-evolution increases efficiency: memory reuse of prior successful scenarios grows during optimisation and often exceeds 0.5 by the end of evolution, indicating accumulated patterns are exploited to improve subsequent attacks.

Limitations

Experiments assume a black-box adversary that can query the agent and observe traces but cannot modify tools or environments. All evaluations were performed in controlled, sandboxed settings and not against real-world deployed systems. Implementation choices use strong LLMs as judges and component generators and fixed hyperparameters, which may influence generality. Defences evaluated are representative and not exhaustive. The memory admission policy and refusal detector are strict and rely on heuristics.

Implications

From an offensive-security viewpoint, TRACE demonstrates that an attacker can covertly induce advanced agents to carry out multi-step attacks by fragmenting objectives, disguising dangerous subtasks within plausible operational scenarios and iteratively adapting those scenarios based on execution feedback. Attackers can scale and improve efficacy over time by reusing successful scenarios and component patterns. Practical exploit workflows shown include vulnerability analysis, exploit payload construction and cryptanalysis, highlighting the urgent need to track composite task origins, detect staged or evolving plans and harden planning and tool-invocation pipelines in agentic systems.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.