Iterative LLM jailbreaks produce executable attack code

Attacks

Published: Wed, Sep 17, 2025 • By Lydia Stratus

Iterative LLM jailbreaks produce executable attack code

New research shows attackers can iteratively nudge Large Language Models (LLMs) to turn vague malicious requests into concrete, often runnable code. Refinement steps lift jailbreak success from about 7% to over 60% and keep per-prompt cost low. The finding raises immediate operational risks for model deployments and automated pipelines.

Researchers demonstrate a practical jailbreak method called Content Concretization that turns abstract malicious prompts into concrete, often executable outputs. The approach uses a two-stage pipeline: a lower-tier model drafts a less constrained response, then a higher-tier model refines that draft. After three refinement iterations the paper reports jailbreak success rising from about 7% to roughly 62%, with per-prompt costs near 7.5 cents.

This matters for operations and security teams because the attack chain is cheap, repeatable and produces outputs that require minimal editing to run. The study evaluated 350 cybersecurity prompts and produced examples including denial-of-service style tools, spear-phishing automation and SQL injection scripts, showing the method can generate actionable code in a controlled setting.

Large Language Model (LLM) deployments often rely on stacked safety filters. Content Concretization defeats those stacks by distributing the constraint bypass across models: one model produces a permissive draft and the other converts it into production-ready content. The authors tested a specific pairing and used automated plus manual review to measure technical quality and malicious intent.

Risk is practical not theoretical. The generated outputs passed a notable fraction of unit tests and were rated more malicious and technically superior after refinements. Low operational cost lowers the bar for attackers to build or scale malicious tooling, and automated pipelines that accept model outputs with weak controls are particularly exposed.

Mitigations

Defenders should assume multi-step jailbreaks are possible and respond with layered controls. Key actions include strict code-generation policies, runtime sandboxing with execution limits, and cross-model verification of intent and safety. The paper suggests runtime monitoring of prompt-output chains and stronger multi-layer safeguards to resist iterative refinement.

Log and trace prompt-output chains end to end for audit and anomaly detection
Reject or sandbox any model-produced code by default and require human review for risky outputs
Introduce cross-model semantic checks rather than single-model rely-on refusals

Limitations matter: the study focuses on cybersecurity prompts and a limited model pairing, so results may vary across domains and architectures. Still, the work highlights a realistic escalation path that operations teams must harden against. Practitioners should add iterated-refinement scenarios to red teams and deployment playbooks now, because cheap refinement attacks will only get easier.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Jailbreaking Large Language Models Through Content Concretization

Authors: Johan Wahréus, Ahmed Hussain, and Panos Papadimitratos

Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7\% (no refinements) to 62\% after three refinement iterations, while maintaining a cost of 7.5\textcent~per prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates how safety mechanisms in large language models (LLMs) can be bypassed through a jailbreaking approach called Content Concretization (CC). CC converts abstract malicious requests into concrete executable outputs via an iterative two stage process: first a draft is produced by lower tier, less constrained safety filter models, then a higher tier model refines the draft using both the preliminary content and the original prompt. The study aims to understand how safely guarded code generation can be exploited, assess the practical risk, and highlight gaps in current safety frameworks. Using cybersecurity prompts, the work measures how refinements affect jailbreak success rates and the quality of outputs, emphasising that harmful content can be produced with relatively low cost and with limited editing required to deploy in some cases. The findings raise concerns about current LLM safety stacks and the need for stronger, multi layer safeguards.

Approach

The authors design a two stage Content Concretization pipeline: a draft generation phase with a lower tier LLM that has reduced safety constraints, followed by a refinement phase with a higher tier LLM that processes both the draft and the original prompt to produce production ready code. The lower tier selected is OpenAI GPT 4o mini, while the higher tier is Anthropic Claude 3.7 Sonnet. The study deliberately excludes prompt obfuscation and prompt engineering techniques to attribute improvements solely to concretisation. Evaluation uses 350 cybersecurity prompts drawn from the CySecBench dataset, focused on seven cybersecurity categories, to assess effectiveness and reproducibility. A multi stage evaluation framework combines manual review with automated measures: keyword based pre filtering to screen out benign content, and a three member LLM jury across different providers to determine if outputs meet malicious intent, technical accuracy, completeness, and real world applicability. Nine LLM evaluators perform A B style comparisons across architectural variants that differ in the number of refinement steps N, ranging from zero to four. Unit tests are conducted on a subset of outputs to gauge immediate executability, with Python based testing and a total of 160 unit tests across 20 code samples. The cost of generation is monitored, with per prompt costs reported at around 7 5 cents even for higher refinement configurations.

Key Findings

Jailbreak success rate increases with refinement: starting from a baseline SR of about 7 1 per cent, a single refinement raises SR to 57 1 per cent, with three refinements achieving 62 0 per cent; four refinements yield a drop to 46 6 per cent due to refusals during the final stage.
Higher refinement steps are consistently preferred in quality assessments: across nine evaluators, outputs from architectures with more refinement steps are rated as more malicious and technically superior, with 71 8 per cent preference for higher refinements from N 1 to N 3 and 54 5 per cent preference from N 3 to N 4.
Executability improves but remains limited: unit testing on 20 code samples shows 30 per cent pass rate for full automated tests and an 83 7 per cent average pass rate, indicating outputs typically require minor modifications for full operability.
Evidence of real world threat potential: manual evaluation of the highest quality configuration demonstrates outputs capable of performing three malicious tasks, including a SYN flood like Denial of Service tool, a spear phishing script with web scraping, and a SQL injection script, produced in a controlled isolated environment. These results illustrate competence in producing actionable malicious code while acknowledging deployment would face defensive measures and limitations.
Cost and practicality: even at the most intensive refinement, per prompt costs stay around 7 5 cents, suggesting CC is economically accessible to adversaries and could lower barriers to producing sophisticated attacks.

Limitations

The study focuses specifically on cybersecurity oriented outputs and uses a limited model pairing (one lower tier and one higher tier). Automated evaluation, while robust, may still misclassify outputs, and the results may not generalise beyond the chosen prompts or domains. The research notes that improvements in content concretisation could yield higher jailbreak success if prompt obfuscation and engineering techniques are incorporated within the architectural framework. Domain specific design considerations are required for other areas beyond cybersecurity, and cross domain generalisability remains to be tested.

Why It Matters

The work demonstrates a concrete method to bypass LLM safety using Content Concretisation, resulting in a notable rise in jailbreak success on cybersecurity prompts. It highlights an attack chain where harmful outputs can be produced and run with minimal editing, exposing gaps in current safety stacks. The practical implications include easier generation of dangerous code and instructions, potentially aiding cyber attacks, malware development, or evasion of content filters. Mitigation suggestions include stronger multi layer safeguards that resist iterative refinement, improved cross model verification, runtime monitoring of prompt output chains, stricter controls on code generation, and robust sandboxing. Societal and security considerations emphasise risks to critical systems and broader misuse in automated content generation, underscoring the urgency of resilient deployment, red teaming of AI systems, and the need for continued research into defensive countermeasures.

Attribution Original paper on arXiv