Agent Weaves Harmless Queries to Bypass Guardrails
Agents
A Large Language Model (LLM) is a statistical model trained on large amounts of text to predict and generate language. An agent, in this context, is an automated workflow that issues queries to a model and acts on the responses.
The paper introduces the Correlated Knowledge Attack Agent, or CKA-Agent, and it exposes a simple but serious blind spot in current safety stacks. Instead of trying to sneak a single malicious prompt past a filter, CKA-Agent breaks the harmful goal into many harmless-looking subqueries, uses the model's replies to guide further queries, and then assembles the pieces into the forbidden answer. The attack treats the target model as an oracle of correlated knowledge rather than a single prompt target.
What the attack does
CKA-Agent frames jailbreaking as an adaptive, tree-structured search through the model's internal knowledge. It issues locally innocuous queries, scores responses with a hybrid evaluator that mixes introspection and target feedback, and expands promising branches using an Upper Confidence Bound for Trees style selector. A synthesiser assembles fragments from a successful branch and an online judge checks whether the harmful objective is achieved. The authors report the approach works across commercial models including Gemini2.5-Flash/Pro, GPT-oss-120B and Claude-Haiku-4.5, and they observe consistent success: over 95 per cent on evaluated benchmarks and 96 to 98 per cent full success on HarmBench and StrongREJECT in their tests.
Operationally the attack is efficient. The agent uses a depth-limited search (depth five), a small adaptive branching factor (one to three), and at most five iterations. That keeps API calls and token usage lower than many prompt-optimisation baselines, making the method practical for automated red teaming or abuse at scale.
Defence implications and limits
This work shifts the defender's problem from single-shot filtering to tracking cross-query intent and knowledge decomposition. Surface-level detectors and static prompt defences may still catch some attacks, but they struggle when malicious intent is distributed across many benign queries. The paper suggests several defensive directions: monitor evolving query graphs for correlated information pulls, apply context-aware guardrails that infer latent intent across turns, harden access controls and internal representations, and employ defence in depth with human review on high-risk query patterns.
The authors are careful to note limitations. Their evaluation relies on automated judges and capable attack agents; some harmful outputs may not be reconstructible from correlated benign facts, so the method has bounds. They also call for more human-in-the-loop evaluation and benchmarks that exclude inputs solvable by the attacker alone.
Practically, organisations should assume single-query filters are insufficient. Start by logging and analysing multi-turn query patterns, flagging rapid information aggregation on sensitive topics, and routing suspicious chains for human review. Those are not foolproof, but they are the sensible first line of defence until systems can reason about cross-query intent the way an attacker now does.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
A Wolf in Sheep's Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
🔍 ShortSpan Analysis of the Paper
Problem
Large language models remain vulnerable to jailbreaks that bypass safety guardrails to produce harmful content. Existing approaches largely focus on prompt optimisation and single shot prompts; these can be detected by guardrails. The paper identifies a deeper vulnerability rooted in the interconnected structure of an LLM internal knowledge, which allows harmful objectives to be realised by weaving together sequences of benign sub queries that individually evade detection. This knowledge decomposition attack poses a threat even to strong safety systems and highlights the need for robust defence in depth monitoring across multi step reasoning and cross query dependencies.
Approach
The authors introduce the Correlated Knowledge Attack Agent CKA-Agent, a dynamic framework that reframes jailbreaking as adaptive, tree‑structured exploration of the target model’s knowledge base. The agent decomposes the harmful objective into benign, interconnected sub queries and uses responses to guide exploration along multiple paths. It employs a simulation free depth first search style strategy with a hybrid evaluator (introspection plus target feedback) to score nodes, and uses the Upper Confidence Bound for Trees to select promising leaves for expansion. A synthesiser aggregates fragments along a path to attempt a final harmful output, while an online judge evaluates success. The target model acts as a knowledge oracle, with responses serving as internal knowledge; the attacker does not rely on domain priors. The process operates within a three component loop: an Attack Agent that decomposes and branches, the Target Model that provides responses, and the Evaluator/ Judge that drives termination. The method uses a maximum of five iterations and a search depth of five, with an adaptive branching factor from one up to three when uncertainty exists. The evaluation uses HarmBench and StrongREJECT benchmarks, and tests across commercial LLMs including Gemini2.5-Flash/Pro, GPT-oss-120B and Claude-Haiku-4.5, with a Qwen3-32B-abliterated attacker model and Gemini2.5-Flash as judge in main results.
Key Findings
- CKA-Agent outperforms prompt centric and static decomposition baselines, achieving 96 to 98 per cent Full Success across HarmBench and StrongREJECT on strong guardrails.
- The advantage stems from adaptive decomposition guided by real time responses and multiple correlated knowledge paths, enabling multi hop synthesis that defeats single turn detectors.
- Prompt only and representation level defenses such as Llama Guard and Circuit Breaker reduce the effectiveness of some single shot attacks but struggle against multi turn decomposition attacks; Circuit Breaker is the most effective defence among those tested but does not fully prevent CKA-Agent.
- Performance improves with iterations: first iteration yields 80 to 95 per cent success; subsequent iterations add incremental gains; adaptive branching yields robust gains up to five iterations with most successes in the first two.
- CKA-Agent demonstrates a strong cost performance profile, with fewer API calls and lower token usage than baselines for similar success, indicating practical viability for red teaming at scale.
Limitations
The authors acknowledge reliance on automated judges, potential biases in human like evaluation despite cross validation, and the use of capable open source attack agents. They note that the framework assumes harmful outputs can be reconstructed from correlated benign facts, which may not hold for atomic or highly compartmentalised knowledge. They call for future work including domain blind benchmarks that exclude inputs solvable by the attacker alone, human LLM judging, and exploration of defence strategies for multi turn intent reasoning.
Why It Matters
The study shifts defence considerations away from single prompt filtering to monitoring of knowledge decomposition and attack graphs across turns. Defences must account for cross query dependencies and evolving attack graphs, harden internal representations and access controls, and employ context aware guardrails capable of inferring latent harmful intent from sequences of innocuous queries. The work emphasises the societal risk of automated disinformation and illicit guidance at scale and advocates defence in depth and responsible disclosure to foster safer AI systems.