Small LLM swarms jailbreak GPT-4o, find 9 CWEs
Society
Policy keeps circling the same drain: restrict frontier models and you reduce offensive capability. This paper tests that claim and lands a punch. It shows that system scaffolding that coordinates small models is the enabler, not model size, and it backs that with an open-source framework called swarm-attack.
Swarm-attack runs multiple lightweight 1.2 billion parameter Large Language Models (LLMs) as agents that share memory, search in parallel, and evolve strategies. In one sentence: the scaffold is glue code and simple tooling that lets small models act bigger through coordination.
Jailbreak campaign
Five agent instances each launched 225 automated jailbreak attempts against two frontier services. Outcomes were scored by cross-provider LLM judges, then validated using an Effective Harm Rate metric that requires manual verification. Against GPT-4o, the swarm hit an Effective Harm Rate of 45.8% and produced 49 critical-severity breaches. Against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate by judge scoring. Evolutionary optimisation and shared memory mattered: success more than doubled against one target across 15 generations as prompts and strategies were mutated and reused across agents.
Code exploitation pipeline
The same swarm sat inside a classic static-plus-dynamic pipeline: regex pattern detection, role-specialised source review, a hand-crafted exploit seed corpus, post-hoc CWE normalisation, and AddressSanitizer-based crash classification. The target was a deliberately vulnerable C application with nine planted weaknesses across nine CWE classes. With the scaffold, the system recovered 9 of 9 vulnerabilities in roughly four minutes on a consumer MacBook. Disable those scaffold pieces and performance collapsed to 0 of 9 by crash verification and 2 of 9 by citation. The lift came from the system around the model, not the tiny model’s raw reasoning.
One useful jab at current practice: LLM-as-judge scoring sprayed false positives at the critical tail. The authors propose Effective Harm Rate as a more operational metric, but it needs manual verification and does not scale.
There are caveats. Results use a planted target, a single model version and hardware profile, and the assisted pipeline benefited from careful hand-crafted components. A prior measurement bug also shows how evals can leak ground truth. Still, the core point stands: the capability class that motivated restricted release of Anthropic’s Mythos Preview is reproducible at effectively zero cost when you publish the right scaffold. The open question is how this transfers to production code and live services. From a pentester’s seat, the real attack surface is the system design choices you make around memory, coordination, and evaluation loops.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Position: AI Security Policy Should Target Systems, Not Models
🔍 ShortSpan Analysis of the Paper
Problem
This paper examines whether AI security policy that restricts access to individual frontier models meaningfully reduces offensive capability. It argues the main threat is system-level scaffolding that coordinates multiple smaller models, and demonstrates that coordinated multi-agent frameworks can reproduce the capability class behind restricted-release decisions at effectively zero cost on commodity hardware.
Approach
The authors develop swarm-attack, an open-source adversarial testing framework in which multiple lightweight 1.2 billion parameter language models run as distinct agents that share memory, explore in parallel, and evolve strategies through crossover and mutation. Two experiments are reported. Experiment 1 runs five agent instances that perform 225 automated jailbreak attacks each against two frontier services. Attack outcomes are scored by cross-provider LLM judges but validated with a manually verified Effective Harm Rate metric. Experiment 2 embeds the same swarm in a pipeline combining regex pattern detection, role-specialised source analysis, a hand-crafted exploit seed corpus, post-hoc CWE normalisation, and AddressSanitizer-instrumented binary fuzzing against a deliberately vulnerable C application containing nine planted vulnerabilities across nine CWE classes. An autonomous configuration disables the hand-crafted scaffold to isolate the model contribution.
Key Findings
- The swarm achieved an Effective Harm Rate of 45.8% against one frontier model, producing 49 critical-severity breaches; against the other frontier model the Effective Harm Rate was 0% despite a 40% technical success rate as measured by LLM-judge scoring.
- Evolutionary optimisation and shared memory improved jailbreak success: success rates more than doubled against one target across 15 generations, illustrating rapid capability gain from the scaffold rather than model size.
- In the assisted pipeline, the swarm recovered 9 of 9 planted vulnerabilities (100% recall) across diverse CWE classes in approximately four minutes on a consumer MacBook. With scaffold components disabled, the same model recovered 0 of 9 by crash verification and 2 of 9 by citation.
- The scaffold components (seed corpus, regex detection, role prompts, ASan crash classification, and CWE normalisation) account for the performance gap; the 1.2B model alone rarely generated triggering inputs or reliably labelled complex bugs.
- Automated LLM-as-judge scoring produced substantial false positives at the critical-severity tail; the paper proposes the Effective Harm Rate, requiring manual verification, as a more operational metric.
Limitations
Results use a synthetic, planted target rather than production code and reflect a single model version, hardware profile and random seed. The assisted pipeline includes hand-crafted components and design iterations that materially contributed to recall. A prior measurement bug showed how evaluation pipelines can leak ground truth. Effective Harm Rate relies on manual verification and does not scale; run-to-run variance is expected at this model scale.
Implications
Offensive security implications are that modest resources can reproduce high-impact capabilities if agents are coordinated by a publishable scaffold. An attacker with open-weights models, shared memory, parallel search, evolutionary optimisation and simple analysis tooling can automate jailbreak discovery and combine static analysis with dynamic fuzzing to find and exploit vulnerabilities quickly on commodity hardware. Publishing harnesses distributes capability broadly, reducing any defensive window gained by withholding a single frontier model; policy and procurement should therefore prioritise system-level assessment and adversarial testing under operationally meaningful metrics such as Effective Harm Rate.