Persona-conditioned red teaming uncovers diverse LLM jailbreaks
Pentesting
Automated red teaming for Large Language Models (LLMs) often tests the wrong threat model. We probe with tidy templates or single-turn tricks, then act surprised when real attackers use identity, tone and persistence to slip past guardrails. A new study pushes in the right direction: condition the search itself on who the attacker claims to be and how they operate.
How PCAP breaks things
Persona-Conditioned Adversarial Prompting (PCAP) bolts persona and tactic context onto a beam-search red-teaming loop. The system generates a pool of attacker archetypes and strategy cards, then runs parallel persona-conditioned searches. Each goal is reframed into persona-specific objectives. Candidate prompts are produced under those constraints, an on-topic filter strips drift, the target model is queried, and an evaluator scores success. Tactics span roleplay, virtualisation, historical context, persuasion and multi-turn framing. Similarity pruning by embedding distance keeps near-duplicates from flooding the queue.
The effect is not subtle. Across models, attack success rate (ASR) jumps sharply. Examples: Llama 3.3 70B climbs from 57.7% to 97.3% ASR, Granite 4.0 H Tiny from 41.0% to 98.0%, and a GPT-OSS 120B target moves from about 58% to about 97%. PCAP also tends to find a first valid prompt earlier under the same search depth and often yields multiple distinct prompts per goal.
Diversity improves too. Where a baseline favours roleplay almost by reflex, PCAP spreads wins across strategies. The distribution shift is measurable: a lower chi-square statistic against a uniform mix (PCAP 1.87 versus 6.38 for the baseline) indicates broader coverage. That matters if you want a dataset that exercises defences beyond the usual theatre-kid routine.
There is a bill to pay. Query counts go up, sometimes a lot. One attacker setup on Granite 3.3 8B rises from roughly 37 to roughly 169 queries per goal. The pruning trick helps, cutting queries by about 40–50% while only slightly denting ASR and prompt yield, but you still feel it in time and tokens.
The jailbreaks travel. Prompts found under PCAP transfer across models and slip past dedicated guardrails. The study reports nontrivial bypass rates against Llama Guard 3 at 29.4–53.0% and Granite Guardian 3.3 at 14.1–33.4%, plus modest ASR on a robust target in the 3.7–8.0% range. Persona-conditioned framing appears to stress the heuristics those guards rely on.
There are caveats. The pipeline leans on LLMs for persona generation, reframing and scoring, so it inherits their blind spots. Automated success judgements are imperfect and still need human adjudication. Efficiency depends on attacker capacity and how many personas you spin up. But the direction is right. If we care about real-world risk, our evaluations should reflect how real adversaries perform: identity-aware, context-rich, sometimes patient. Standard-setters and internal assurance teams will need to decide what “enough” persona coverage looks like, how to budget the extra queries, and how to report these results without handing out a cookbook. Those choices are policy, not just engineering.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies gaps in automated red-teaming for large language models that arise from limited attacker diversity. Existing automated pipelines tend to use narrow templates or single-turn tactics and therefore miss attacks that rely on attacker identity, framing, or staged multi-turn persuasion. This under-coverage can underestimate real-world risk by failing to surface jailbreaks and prompt-injection vectors that exploit narrative, register, or role-play.
Approach
The authors introduce Persona-Conditioned Adversarial Prompting (PCAP), an orthogonal extension to an existing beam-search red-teaming framework (TAP). PCAP conditions the adversarial search on automatically generated attacker personas and on strategy cards that specify tactics to prioritise. For each persona a parallel beam search is run: goals are reframed into persona-specific objectives, attackers generate candidate prompts conditioned on persona and strategy subsets, an on-topic filter removes irrelevant rewrites, the target model is queried, and an evaluator scores success. PCAP samples personas and strategies from a randomized pool, and supports similarity-based prefiltering to reduce duplicate queries.
Key Findings
- PCAP substantially raises attack success rate (ASR) and prompt yield across models. Example improvements include Llama 3.3 70B from 57.7% to 97.3% ASR and Granite 4.0 H Tiny from 41.0% to 98.0% ASR.
- Prompt diversity and strategy coverage increase: PCAP produces a more balanced mix of strategies (Roleplay, Virtualization, Historical Context, Persuasion and others) versus TAP’s strong bias toward Roleplay, with chi-square statistics indicating PCAP is markedly closer to a uniform strategy distribution (PCAP χ2=1.87 vs TAP χ2=6.38).
- PCAP finds successful prompts earlier under the same search depth and often yields multiple valid prompts per goal, improving dataset diversity for defensive work.
- Query costs rise: some attacker models see large increases in queries per goal (for example Granite 3.3 8B from ~37 to ~169 queries per goal), but similarity pruning based on embedding cosine similarity can cut queries by about 40–50% while only slightly reducing ASR and prompt yield.
- Prompts discovered by PCAP show cross-model transferability and can bypass dedicated guardrails; measured transfer rates include ASR against a robust target in the 3.7–8.0% range and nontrivial bypass fractions for specific guardrails (Llama Guard 3: 29.4–53.0%; Granite Guardian 3.3: 14.1–33.4%).
Limitations
PCAP increases query and compute costs, particularly for lower-capacity attacker models where additional personas yield diminishing returns. The pipeline depends on LLMs for persona generation, reframing and evaluation and therefore inherits their biases and blind spots. Automated scoring is imperfect and human adjudication remains necessary before using discovered prompts for critical evaluation or remediation. Some model-specific artefacts can bias attack generation, and choice of attacker capacity and persona count materially affects efficiency.
Implications
Offensively, PCAP shows that conditioning on attacker identity and explicit multi-tactic strategies produces more varied, transferable jailbreaks that are harder to predict and defend against. Attackers can combine multiple techniques within a single prompt and use persona-driven reframing to evade guardrails and exploit different language registers. This highlights the need for guards that consider identity-conditioned prompts, multi-turn persuasion, and broader conversational framings, since static single-shot defences may miss these vectors.