Small LLM swarms jailbreak GPT-4o, find 9 CWEs

Society

Published: Mon, May 11, 2026 • By Rowan Vale

Small LLM swarms jailbreak GPT-4o, find 9 CWEs

New work argues AI security policy should target systems, not standalone models. An open-source multi-agent scaffold lets five 1.2B-parameter LLMs coordinate to bypass safety and find software flaws on a MacBook. It posts a 45.8% Effective Harm Rate with 49 critical breaches on one model, and recovers 9 of 9 planted CWEs fast.

Policy keeps circling the same drain: restrict frontier models and you reduce offensive capability. This paper tests that claim and lands a punch. It shows that system scaffolding that coordinates small models is the enabler, not model size, and it backs that with an open-source framework called swarm-attack.

Swarm-attack runs multiple lightweight 1.2 billion parameter Large Language Models (LLMs) as agents that share memory, search in parallel, and evolve strategies. In one sentence: the scaffold is glue code and simple tooling that lets small models act bigger through coordination.

Jailbreak campaign

Five agent instances each launched 225 automated jailbreak attempts against two frontier services. Outcomes were scored by cross-provider LLM judges, then validated using an Effective Harm Rate metric that requires manual verification. Against GPT-4o, the swarm hit an Effective Harm Rate of 45.8% and produced 49 critical-severity breaches. Against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate by judge scoring. Evolutionary optimisation and shared memory mattered: success more than doubled against one target across 15 generations as prompts and strategies were mutated and reused across agents.

Code exploitation pipeline

The same swarm sat inside a classic static-plus-dynamic pipeline: regex pattern detection, role-specialised source review, a hand-crafted exploit seed corpus, post-hoc CWE normalisation, and AddressSanitizer-based crash classification. The target was a deliberately vulnerable C application with nine planted weaknesses across nine CWE classes. With the scaffold, the system recovered 9 of 9 vulnerabilities in roughly four minutes on a consumer MacBook. Disable those scaffold pieces and performance collapsed to 0 of 9 by crash verification and 2 of 9 by citation. The lift came from the system around the model, not the tiny model’s raw reasoning.

One useful jab at current practice: LLM-as-judge scoring sprayed false positives at the critical tail. The authors propose Effective Harm Rate as a more operational metric, but it needs manual verification and does not scale.

There are caveats. Results use a planted target, a single model version and hardware profile, and the assisted pipeline benefited from careful hand-crafted components. A prior measurement bug also shows how evals can leak ground truth. Still, the core point stands: the capability class that motivated restricted release of Anthropic’s Mythos Preview is reproducible at effectively zero cost when you publish the right scaffold. The open question is how this transfers to production code and live services. From a pentester’s seat, the real attack surface is the system design choices you make around memory, coordination, and evaluation loops.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Position: AI Security Policy Should Target Systems, Not Models

Authors: Michael A. Riegler and Inga Strümke

We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

🔍 ShortSpan Analysis of the Paper

Problem

This paper examines whether AI security policy that restricts access to individual frontier models meaningfully reduces offensive capability. It argues the main threat is system-level scaffolding that coordinates multiple smaller models, and demonstrates that coordinated multi-agent frameworks can reproduce the capability class behind restricted-release decisions at effectively zero cost on commodity hardware.

Approach

The authors develop swarm-attack, an open-source adversarial testing framework in which multiple lightweight 1.2 billion parameter language models run as distinct agents that share memory, explore in parallel, and evolve strategies through crossover and mutation. Two experiments are reported. Experiment 1 runs five agent instances that perform 225 automated jailbreak attacks each against two frontier services. Attack outcomes are scored by cross-provider LLM judges but validated with a manually verified Effective Harm Rate metric. Experiment 2 embeds the same swarm in a pipeline combining regex pattern detection, role-specialised source analysis, a hand-crafted exploit seed corpus, post-hoc CWE normalisation, and AddressSanitizer-instrumented binary fuzzing against a deliberately vulnerable C application containing nine planted vulnerabilities across nine CWE classes. An autonomous configuration disables the hand-crafted scaffold to isolate the model contribution.

Key Findings

The swarm achieved an Effective Harm Rate of 45.8% against one frontier model, producing 49 critical-severity breaches; against the other frontier model the Effective Harm Rate was 0% despite a 40% technical success rate as measured by LLM-judge scoring.
Evolutionary optimisation and shared memory improved jailbreak success: success rates more than doubled against one target across 15 generations, illustrating rapid capability gain from the scaffold rather than model size.
In the assisted pipeline, the swarm recovered 9 of 9 planted vulnerabilities (100% recall) across diverse CWE classes in approximately four minutes on a consumer MacBook. With scaffold components disabled, the same model recovered 0 of 9 by crash verification and 2 of 9 by citation.
The scaffold components (seed corpus, regex detection, role prompts, ASan crash classification, and CWE normalisation) account for the performance gap; the 1.2B model alone rarely generated triggering inputs or reliably labelled complex bugs.
Automated LLM-as-judge scoring produced substantial false positives at the critical-severity tail; the paper proposes the Effective Harm Rate, requiring manual verification, as a more operational metric.

Limitations

Results use a synthetic, planted target rather than production code and reflect a single model version, hardware profile and random seed. The assisted pipeline includes hand-crafted components and design iterations that materially contributed to recall. A prior measurement bug showed how evaluation pipelines can leak ground truth. Effective Harm Rate relies on manual verification and does not scale; run-to-run variance is expected at this model scale.

Implications

Offensive security implications are that modest resources can reproduce high-impact capabilities if agents are coordinated by a publishable scaffold. An attacker with open-weights models, shared memory, parallel search, evolutionary optimisation and simple analysis tooling can automate jailbreak discovery and combine static analysis with dynamic fuzzing to find and exploit vulnerabilities quickly on commodity hardware. Publishing harnesses distributes capability broadly, reducing any defensive window gained by withholding a single frontier model; policy and procurement should therefore prioritise system-level assessment and adversarial testing under operationally meaningful metrics such as Effective Harm Rate.

Links Original paper on arXiv

Small LLM swarms jailbreak GPT-4o, find 9 CWEs

Jailbreak campaign

Code exploitation pipeline

📋 Original Paper Title and Abstract

Position: AI Security Policy Should Target Systems, Not Models

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Frontier tests reveal risky LLM agent behaviour

Study exposes agentic AI security gaps across models

Benchmarking OpenClaw Agents Exposes Lifecycle Security Risks

Related Research

Get the Weekly AI Security Digest