Train Agents to Find Vulnerabilities at Scale
Pentesting
A large language model is an AI system trained to generate and reason about text.
New work packages 658 real capture-the-flag challenges into CTF-Dojo and spins up environments in minutes with CTF-Forge, then uses verified agent runs to teach models how to find bugs. The headline result: a 32B model trained on under 500 verified trajectories gets notable lifts versus strong baselines. That means more capable open models at lower cost and faster iteration.
Why this matters: if you care about automated testing, this is huge - it lowers the barrier to building tools that find real vulnerabilities. If you care about risk, it is equally huge - the same methods make it easier to build agents that chain exploits without human oversight.
Quick checklist for teams that must ship or defend:
- Isolate - run training and evaluation in locked-down containers with no external network access
- Verify - require execution-verified traces before using interactive data for training
- Control - maintain strict access logs and role-based keys for any agent capable of exploit-finding
Minimal viable controls: sandboxed execution, per-run attestations, and replayable logs. Good-better-best options: good - ephemeral networks and RBAC; better - deterministic replay plus gated review of agent actions; best - third-party attestation, cryptographic logs, and staged deployment to non-production targets only.
In short: the paper gives defenders a powerful tool and attackers a clearer road map. Fixes are straightforward - isolate, verify, and limit - but you need to do them now, not after a headline exploit forces your hand.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Training Language Model Agents to Find Vulnerabilities with CTF-Dojo
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies how to train large language model agents using execution-grounded environments to find and exploit software vulnerabilities. It addresses the scarcity of scalable, reproducible runtimes that provide verifiable feedback for iterative, multi-step security tasks and the resulting gap that limits open-weight model performance in offensive cybersecurity evaluations.
Approach
The authors introduce CTF-DOJO, a collection of 658 containerised Capture‑The‑Flag challenges, and CTF-FORGE, an automated pipeline that converts public CTF artifacts into Dockerised runtimes in minutes (average 0.5 seconds per container) with a 98% automated validation pass rate. They used an ENIGMA+ agent scaffold to collect interaction traces, gathering 1,006 raw successful trajectories and filtering to 486 verified trajectories from 274 solved challenges. Trajectories were generated with teacher models including DeepSeek‑V3‑0324 and Qwen3 variants, then used for rejection‑sampling fine‑tuning of Qwen3-based models up to 32B parameters. Evaluation used three benchmarks: InterCode‑CTF, NYU CTF Bench and Cybench, measured with Pass@1.
Key Findings
- Data efficiency: Fine‑tuning on 486 execution‑verified trajectories yielded up to 11.6% absolute gains and produced a 32B model with 31.9% Pass@1, approaching frontier systems.
- Environment tooling matters: CTF‑FORGE automated rehosting and runtime augmentation (randomised ports, paths and dynamic flags) improved trajectory yield and generalisation; augmentations increased solved counts substantially.
- External writeups and teacher diversity boost coverage: incorporating redacted public writeups increased solved rates (on average +7.4%) and combining multiple teacher models raised downstream performance.
Limitations
The dataset is finite and static (658 challenges) and current results rely on available writeups and teacher models; evaluation on live, continually changing competitions is future work. Training compute, runtime cost and full deployment infrastructure details: not reported. Potential overfitting to CTF artefacts and remaining unsolved tasks indicate coverage limits.
Why It Matters
CTF‑DOJO demonstrates that execution‑grounded training can rapidly increase automated vulnerability discovery capability, lowering the barrier to effective open‑weight security agents. This has defensive benefits for scalable testing and patching but clear dual‑use risks: more capable agents could autonomously identify and chain exploits if misused. The authors highlight the need for strong sandboxing, reproducible and verifiable feedback loops, strict access controls and careful deployment pipelines to reduce misuse while enabling research and responsible tooling adoption.