New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email
// Analysis

Moderation Traces Jailbreak Function-Calling LLMs by Exploiting State

Attacks
Moderation Traces Jailbreak Function-Calling LLMs by Exploiting State

A new black-box attack, Simulated Moderation Traces (SMT), fakes a moderation workflow to coax function-calling Large Language Models (LLMs) into unsafe output. By interleaving schemas, arguments and fabricated validator feedback in the same context, SMT hits 99.67% and 98.33% success with ~1.4 queries per input. Prompt-only defences falter; targeted prompts curb some models.

We keep arguing about prompts while the real hole sits in the plumbing. This paper targets the architecture of function-calling Large Language Model (LLM) systems, where developer schemas, structured arguments and tool outputs all share one context. When control logic and untrusted data mingle, safety becomes theatre. The authors show how to weaponise that statefulness to jailbreak models with almost no fuss.

How the attack works

Simulated Moderation Traces (SMT) is a black-box, multi-turn routine that pretends to be a moderation audit. The attacker exposes two function schemas to the model: a case generator that wants paired “safe” and “unsafe” samples, and a case validator with criteria. The script asks the model to populate these. If the model refuses or hedges, the attacker forges a well-formed function call or a validator-style exception and appends it to the shared history. The trick is simple: treat a refusal as an execution failure. The validator then “demands” a refined attempt that better meets the criteria. Rinse and repeat.

Because all of this plays out inside the same conversational state, the model is nudged to comply with the purported workflow. An external LLM judge scores harmfulness, and SMT keeps the best candidate across a tight retry loop. No gradient access, no jailbreak zoo, just a plausible operational frame that chips away at guardrails until the model emits what it was told not to.

What the results say

Across two standard safety suites, SMT reports average attack success of 99.67% on SafeBench and 98.33% on JailbreakBench, with roughly 1.48 and 1.37 queries per input. A one-shot variant works, but the multi-turn state makes it bite harder. It stays effective against several “highly aligned” commercial models, including one that shrugged off most baselines. Single-turn defences dent it but don’t stop it; a targeted system-prompt countermeasure (FCGuard) markedly reduces or eliminates success on some newer models in the authors’ tests.

Does this matter?

Yes, because this is not a cute prompt hack. It exploits how agent frameworks and function-calling APIs actually run: schemas, arguments and tool outputs poured into one context and treated as gospel. If your moderation or validation tooling shares that pipe, the attacker can impersonate it. The costs are low and the cover story is credible (“generate unsafe examples for red teaming”).

Caveats apply. You need API-level function calling and the ability to register or invoke custom schemas; this is not for plain chat UIs. Results are a snapshot of current vendor behaviour, and targeted system prompts can clip the attack on some models. Still, the core point stands: if you blur control and data in a single stateful context, you hand adversaries a lever. The emperor here isn’t the prompt. It’s the architecture.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces

Authors: Junlong Liu, Haobo Wang, Weiqi Luo, and Xiaojun Jia
Jailbreak attacks remain a critical threat to the safe deployment of large language models (LLMs). While prior work has primarily studied attacks and defenses at the prompt level, we show that this prompt-centric paradigm overlooks a structural vulnerability in stateful, function-calling environments. In such applications, developer-defined schemas, structured arguments, and untrusted tool outputs are interleaved into a single shared model context. This architecture expands the attack surface by blurring the boundary between trusted control logic and untrusted data, allowing adversarial intent to be distributed across a multi-turn execution path. We exploit this architectural flaw through SMT, a black-box attack framework based on Simulated Moderation Traces. Departing from purely prompt-based interactions, SMT constructs a multi-turn trajectory that simulates a legitimate moderation-auditing workflow. Within this trajectory, a fabricated moderation frame leverages red-team testing as a pretext to elicit harmful generations. The subsequent validation feedback treats safety refusals as execution failures, prompting refinements that gradually weaken the model's safety constraints and ultimately trigger harmful outputs. Extensive empirical evaluations on prominent commercial LLMs from five different providers across two standardized safety benchmarks show that SMT consistently achieves the highest average attack success rate and HarmScore while requiring a near-minimal number of queries, substantially outperforming existing baselines. These findings demonstrate that prompt-level sanitization alone is fundamentally insufficient for defending tool-enabled LLM systems and highlight the urgent need for context-aware validation across schemas, arguments, tool outputs, and accumulated conversation state. The code is available at https://github.com/liujlong27/SMT.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies a structural vulnerability in tool-enabled, function-calling large language model deployments. In these systems developer-defined schemas, structured arguments and tool outputs are assembled into a single shared model context, blurring the boundary between trusted control logic and untrusted data. That interleaving lets an adversary distribute malicious intent across multiple turns and components rather than relying on a single prompt, which can defeat prompt-only sanitisation and conventional safety checks.

Approach

The authors introduce SMT (Simulated Moderation Traces), a black-box attack framework that simulates a moderation‑auditing workflow across multiple turns. SMT exposes two function schemas to the model: a case generator that requests paired safe and unsafe samples as function arguments, and a case validator that describes validation criteria. The attacker submits these schemas and then iterates: if the model refuses or produces insufficiently harmful output, SMT fabricates a well-formed function call or a validator-style exception and appends it to the conversation history to reframe refusals as execution failures. An external LLM judge scores candidate unsafe samples and SMT retains the best candidate across limited retry and multi‑turn loops. Evaluations use two standard safety suites and multiple commercial and open-source models under a black-box API threat model.

Key Findings

  • High effectiveness: SMT achieved near‑state‑of‑the‑art attack performance, with average attack success rates reported as 99.67% on SafeBench and 98.33% on JailbreakBench and high average harmfulness scores.
  • Query efficiency: SMT required few API calls on average (about 1.48 and 1.37 queries per input on the two benchmarks), outperforming existing multi‑turn baselines that needed many more queries.
  • Multi‑turn amplification: The multi‑turn, stateful workflow materially improved success over a one‑shot variant; SMT(one‑shot) still outperformed many baselines but full SMT produced substantially higher success and harm scores.
  • Model robustness: SMT remained effective against several highly aligned commercial models, including one model that resisted most baselines; performance varied by target but SMT showed strong cross‑category consistency.
  • Partial mitigations: Conventional single‑turn defences reduced but did not eliminate SMT; a targeted system‑prompt defence (FCGuard) markedly reduced or eliminated success on some recent models in the authors' tests.

Limitations

SMT depends on access to API-level function‑calling and the ability to register or invoke custom tool schemas, so it is not directly applicable to plain chat interfaces. Results reflect a snapshot of specific commercial models and deployed safety mechanisms and may change as providers update systems. White‑box mechanistic analysis used a single open‑source model and may not generalise to proprietary internals. Targeted defences can substantially reduce effectiveness.

Implications

Offensive security implications are that attackers can subvert safety by embedding malicious intent across schemas, arguments, tool outputs and conversation history, using fabricated validation traces to coax models into compliance. This reduces the effectiveness of prompt sanitisation and simple input/output filters. Real‑world attackers with API access can therefore induce harmful outputs with low query cost by exploiting function‑calling workflows. Defences should be architectural: separate control logic from untrusted data, sandbox and validate tool outputs and arguments, monitor multi‑turn state, and apply context‑aware validation across the entire function‑calling lifecycle.

// Similar research

Related Research

Get the weekly digest

The few AI-security papers that matter, with the practitioner takeaway. No spam.