Auto-synthesised agent harnesses find Chrome zero-days
Agents
Multi-agent Large Language Model (LLM) setups are increasingly used to hunt bugs. The catch: results often hinge less on the model and more on the glue code that tells agents who does what, with which tools, and how they talk. This paper targets that glue. It shows the orchestrator, not the model, is the main lever.
AgentFlow encodes a harness as a typed graph: nodes are agents, edges move messages or trigger retries. The design space includes agent roles and prompts, the topology of who talks to whom, per-edge message schemas, tool allocation, and the coordination protocol. An outer loop proposes small harness edits, runs them if they pass a cheap static type check, and consumes structured runtime feedback from the target: test verdicts, stdout/stderr, line-level coverage, and sanitiser outputs. A diagnoser reads both program signals and agent traces to decide why a run failed. Did the inputs never reach the interesting code? Or did execution reach it but fail to trigger a condition? The next edit targets that root cause instead of random thrashing.
On the TerminalBench-2 benchmark with Claude Opus 4.6, a single synthesised harness reached 84.3%, the top score on the public leaderboard snapshot they evaluated. More interestingly, the same loop, driven by Kimi K2.5, found ten previously unknown zero-days in Google Chrome, including two Critical sandbox-escape issues (CVE-2026-5280 and CVE-2026-6297). The practical point: with source access and instrumentation, runtime signals turn agent orchestration into an iterative search that can reliably reach and trigger deep bugs, not just toy crashes.
Why it works is mundane and therefore useful. Pass/fail is a blunt tool; coverage and sanitiser reports are specific. If coverage shows you never hit the target code, you restructure the agent graph or change tool access to improve reachability. If you hit it without a fault, you adjust inputs or the retry links and message schemas to push state over the edge. The harness becomes the optimisation surface, and a type system keeps the search from dissolving into nonsense.
There are caveats. This needs source and instrumented builds; without coverage and sanitiser signals the diagnoser loses teeth. The DSL uses static topologies, so no dynamic agent spawning mid-run. The loop still depends on LLM quality and a non-trivial compute budget, and the Chrome campaign is a case study rather than a compute-matched bake-off. Even so, the commercial read is clear: if you control the build and can collect runtime signals, orchestration engineering can matter more than swapping models. Expect red teams and motivated attackers with source access to copy this playbook. If you were hoping for magic, this is not it. It is plumbing, and the pipes apparently leak. Watch this space.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how the program that organises multiple language-model agents, called a harness, determines success in automated vulnerability discovery. Hand-crafted harnesses are brittle and small changes can multiply task success rates, yet prior automatic harness optimisers search only narrow parts of the design space and use coarse pass/fail feedback that gives little diagnostic value. Improving harness synthesis is important because multi-agent systems are increasingly able to find real zero-day bugs that human auditors and classical fuzzers missed.
Approach
AgentFlow represents a harness as a typed graph domain-specific language in which nodes are agents and edges encode dataflow or retry links. The DSL exposes five first-class components: the agent set, communication topology, per-edge message schemas, tool allocation, and coordination protocol. An outer-loop optimiser proposes local edits to the DSL, validates them cheaply with a static type check, executes well-formed harnesses, and collects structured runtime feedback from the target program (test verdicts, stdout/stderr, line-level coverage, and sanitizer reports). A diagnoser reads full feedback and agent traces to localise why a harness failed and suggests corrective edits; the proposer uses that diagnosis and a short archive of past trials to emit the next harness proposal. The same synthesis loop was evaluated on the TerminalBench-2 benchmark using Claude Opus 4.6 and on the Google Chrome codebase using Kimi K2.5.
Key Findings
- Full-space search with typing yields large gains: AgentFlow synthesised a single harness that achieves an 84.3% pass rate on TerminalBench-2 with Claude Opus 4.6, the highest score in the public leaderboard snapshot evaluated.
- Runtime diagnostics materially improve search: by consuming coverage and sanitizer outputs, the diagnoser can distinguish failures that never reached vulnerable code from those that reached it but did not trigger an error, enabling targeted edits rather than random-walk changes.
- Real-world impact on a large codebase: the same loop, driven by Kimi K2.5, found ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape bugs reported as CVE-2026-5280 and CVE-2026-6297, all confirmed by the vendor.
Limitations
AgentFlow requires source access and the ability to instrument builds so that coverage and sanitizer channels are available; without those channels diagnostic quality falls. The DSL restricts harnesses to static topologies, disallowing dynamic agent spawning during a single execution. The optimisation process depends on LLM proposers and diagnosers, and malformed proposals still occur, though roughly 20% are filtered cheaply by the type checker in experiments. The Chrome campaign used a mid-tier open-weight model and a substantial compute budget, so results are presented as a case study rather than a compute-matched comparative benchmark.
Implications
From an offensive-security viewpoint, AgentFlow shows that harness design alone can substantially increase automated exploit discovery. An attacker or red team with access to target source and instrumentation could automate multi-agent orchestration and use structured program signals to rapidly iterate towards inputs that reach and trigger bugs, including sandbox-escape flaws. The work highlights that harness synthesis is itself a critical attack surface: controlling the orchestrator can multiply the effectiveness of otherwise fixed LLM agents.