Intent Laundering Breaks Cue-Driven LLM Safety
Pentesting
AI safety scores look reassuring until you notice how often they test the wrong thing. A new paper argues that popular benchmarks reward models for spotting taboo words, not malicious intent. That is a neat way to pass tests and a poor way to survive contact with an adversary.
The authors interrogate two staples of safety evaluation, AdvBench and HarmBench. They start with the text itself: n‑gram plots and embedding similarity show heavy reuse and contrived phrasing. In AdvBench, over 45% of items are near‑identical at a 0.95 similarity threshold and more than 11% are almost exact copies at 0.99. In other words, a lot of prompts look the same and loudly shout their intentions.
Then comes the pivot. The team introduces 'intent laundering': rewrite a malicious prompt to remove overtly negative connotations while keeping the same goal and actionable detail. Two moves power it. Connotation neutralisation swaps loaded words for neutral descriptions. Context transposition maps real‑world targets to non‑real settings while keeping specifics transferable. A Large Language Model (LLM) performs the laundering via few‑shot prompting; a separate LLM judge scores resulting model outputs for safety and practicality, mapping five‑point ratings to binary decisions.
The numbers shift sharply once the cues go. On AdvBench, mean attack success jumps from 5.38% to 86.79% after a single laundered revision. On HarmBench it moves from 13.79% to 79.83%. The responses do not degrade into fluff either; practicality stays high, indicating the instructions remain usable in the real world. The automated judge aligns with human consensus 90% of the time on safety labels and 94% on practicality in sampled checks.
They also turn laundering into a jailbreak. An iterative revision–regeneration loop, run under fully black‑box access, pushes success rates to between 90% and 98.55% across all evaluated models. That includes models widely described as highly safe, such as Gemini 3 Pro and Claude Sonnet 3.7. The headline claim is blunt: once you strip the triggering cues, all previously assessed 'reasonably safe' models become unsafe.
The buried lede is that we have been measuring cue refusal, not intent resistance. If your red team toolkit leans on benchmarks that scream their own malice, you are likely overestimating your posture. Real attackers do not announce 'build a bomb'; they obfuscate. This work shows how easily a model will cooperate when the language is tidied but the goal is untouched.
There are caveats. The study focuses on two public datasets and a specific, automated laundering pipeline. Results depend on the chosen dataset subsets after deduplication and on the capabilities of current LLMs used to launder and to judge. The judge is good but not perfect, and the binary mapping may flatten nuance. None of that rescues cue‑based defences, but it does bound the claims.
What this means for pentesters
- Do not rely on trigger‑word prompts. Test with cue‑free rewrites that preserve intent and detail.
- Deduplicate and refresh benchmarks; near‑duplicates invite overfitting to phrasing.
- Score practicality and transferability, not just refusals; treat black‑box iterative rewriting as a standard test.
The open question is whether model alignment can track implied malice without collapsing into false positives. Until then, evaluations that depend on spotting the obvious will keep flattering to deceive, and adversaries will keep writing around your filters.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Intent Laundering: AI Safety Datasets Are Not What They Seem
🔍 ShortSpan Analysis of the Paper
Problem
This paper asks whether widely used AI safety datasets genuinely reflect real‑world adversarial attacks. It argues that contemporary benchmarks overuse overt "triggering cues"—words or phrases with explicit negative or sensitive connotations—that are likely to provoke model refusals but are unrepresentative of how real attackers craft prompts. If true, evaluations based on these datasets may overstate model safety and leave deployed systems exposed.
Approach
The authors study two popular safety datasets, AdvBench and HarmBench. First, they analyse the datasets in isolation using n‑gram frequency visualisations and pairwise similarity of embeddings to identify recurring triggering cues and duplication. Second, they introduce "intent laundering", a procedure that removes overt triggering language while preserving malicious intent and actionable detail. Intent laundering has two components: connotation neutralisation, which replaces triggering expressions with neutral or descriptive alternatives, and context transposition, which maps real‑world referents to non‑real contexts while keeping details transferable. The laundering is automated via a large language model in a few‑shot in‑context learning setup (the "intent launderer"), and model responses to original and laundered prompts are judged for safety and practicality by a separate LLM judge using five‑point Likert scales mapped to binary outcomes. They also add an iterative revision–regeneration loop to turn laundering into a jailbreak method, feeding failed revisions back to generate improved revisions under black‑box conditions.
Key Findings
- Safety datasets overrepresent triggering cues and near‑duplicates. In AdvBench over 45% of data points are near‑identical at a 0.95 similarity threshold and over 11% are almost exact copies at 0.99, indicating heavy duplication and contrived phrasing.
- Removing triggering cues sharply increases attack success. Mean attack success rates rise from 5.38% to 86.79% on AdvBench and from 13.79% to 79.83% on HarmBench after a single intent‑laundering revision, showing that model refusals largely track explicit cues rather than underlying intent.
- Intent laundering is an effective jailbreaking technique. With an iterative revision–regeneration loop under fully black‑box access, attack success rates reached 90% to 98.55% across all evaluated models, including models reported as highly safe such as Gemini 3 Pro and Claude Sonnet 3.7.
- Responses to laundered prompts remain practical and transferable. High practicality ratings indicate that abstracted prompts elicit detailed, real‑world applicable instructions despite neutralised language.
- Automated judging is reliable. The LLM judge agreed with human expert consensus on binary safety labels 90% of the time and on practicality labels 94% of the time in a sampled evaluation.
Limitations
The study focuses on two publicly available datasets and a specific automated laundering procedure using contemporary LLMs. Results depend on the dataset subsets selected after deduplication and on LLM capacities used for laundering and judging. The paper acknowledges potential impacts on public trust and reports that affected providers were informed.
Why It Matters
The paper exposes a disconnect between benchmarked safety and real‑world adversarial behaviour: defences tuned to detect overt triggering cues can be bypassed easily when malicious intent is implied rather than explicit. For AI security this implies the need for more realistic red‑teaming, cue‑free evaluation, and alignment methods that do not rely on brittle cue detection. The findings suggest current evaluations may dramatically understate risk in deployed systems.