Train Agents to Find Vulnerabilities at Scale

Pentesting

Published: Tue, Aug 26, 2025 • By Rowan Vale

Train Agents to Find Vulnerabilities at Scale

Researchers build CTF-Dojo and CTF-Forge, a scalable runtime and automation pipeline that trains language-model agents on containerized capture-the-flag challenges. They show small verified training sets yield big gains in exploit-finding ability, improving open models while raising clear risks for misuse. This forces urgent, practical containment and control decisions.

A large language model is an AI system trained to generate and reason about text.

New work packages 658 real capture-the-flag challenges into CTF-Dojo and spins up environments in minutes with CTF-Forge, then uses verified agent runs to teach models how to find bugs. The headline result: a 32B model trained on under 500 verified trajectories gets notable lifts versus strong baselines. That means more capable open models at lower cost and faster iteration.

Why this matters: if you care about automated testing, this is huge - it lowers the barrier to building tools that find real vulnerabilities. If you care about risk, it is equally huge - the same methods make it easier to build agents that chain exploits without human oversight.

Quick checklist for teams that must ship or defend:

Isolate - run training and evaluation in locked-down containers with no external network access
Verify - require execution-verified traces before using interactive data for training
Control - maintain strict access logs and role-based keys for any agent capable of exploit-finding

Minimal viable controls: sandboxed execution, per-run attestations, and replayable logs. Good-better-best options: good - ephemeral networks and RBAC; better - deterministic replay plus gated review of agent actions; best - third-party attestation, cryptographic logs, and staged deployment to non-production targets only.

In short: the paper gives defenders a powerful tool and attackers a clearer road map. Fixes are straightforward - isolate, verify, and limit - but you need to do them now, not after a headline exploit forces your hand.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies how to train large language model agents using execution-grounded environments to find and exploit software vulnerabilities. It addresses the scarcity of scalable, reproducible runtimes that provide verifiable feedback for iterative, multi-step security tasks and the resulting gap that limits open-weight model performance in offensive cybersecurity evaluations.

Approach

The authors introduce CTF-DOJO, a collection of 658 containerised Capture‑The‑Flag challenges, and CTF-FORGE, an automated pipeline that converts public CTF artifacts into Dockerised runtimes in minutes (average 0.5 seconds per container) with a 98% automated validation pass rate. They used an ENIGMA+ agent scaffold to collect interaction traces, gathering 1,006 raw successful trajectories and filtering to 486 verified trajectories from 274 solved challenges. Trajectories were generated with teacher models including DeepSeek‑V3‑0324 and Qwen3 variants, then used for rejection‑sampling fine‑tuning of Qwen3-based models up to 32B parameters. Evaluation used three benchmarks: InterCode‑CTF, NYU CTF Bench and Cybench, measured with Pass@1.

Key Findings

Data efficiency: Fine‑tuning on 486 execution‑verified trajectories yielded up to 11.6% absolute gains and produced a 32B model with 31.9% Pass@1, approaching frontier systems.
Environment tooling matters: CTF‑FORGE automated rehosting and runtime augmentation (randomised ports, paths and dynamic flags) improved trajectory yield and generalisation; augmentations increased solved counts substantially.
External writeups and teacher diversity boost coverage: incorporating redacted public writeups increased solved rates (on average +7.4%) and combining multiple teacher models raised downstream performance.

Limitations

The dataset is finite and static (658 challenges) and current results rely on available writeups and teacher models; evaluation on live, continually changing competitions is future work. Training compute, runtime cost and full deployment infrastructure details: not reported. Potential overfitting to CTF artefacts and remaining unsolved tasks indicate coverage limits.

Why It Matters

CTF‑DOJO demonstrates that execution‑grounded training can rapidly increase automated vulnerability discovery capability, lowering the barrier to effective open‑weight security agents. This has defensive benefits for scalable testing and patching but clear dual‑use risks: more capable agents could autonomously identify and chain exploits if misused. The authors highlight the need for strong sandboxing, reproducible and verifiable feedback loops, strict access controls and careful deployment pipelines to reduce misuse while enabling research and responsible tooling adoption.

Attribution Original paper on arXiv