ShortSpan.ai logo

Auto-synthesised agent harnesses find Chrome zero-days

Agents
Published: Thu, Apr 23, 2026 • By Adrian Calder
Auto-synthesised agent harnesses find Chrome zero-days
AgentFlow automates how multiple Large Language Model (LLM) agents are wired together for vulnerability discovery. By searching harness design with runtime feedback, it boosts benchmark scores and, in a Chrome case study, found ten new zero-days including two Critical sandbox escapes. Big gains, but only with source access and instrumented builds.

Multi-agent Large Language Model (LLM) setups are increasingly used to hunt bugs. The catch: results often hinge less on the model and more on the glue code that tells agents who does what, with which tools, and how they talk. This paper targets that glue. It shows the orchestrator, not the model, is the main lever.

AgentFlow encodes a harness as a typed graph: nodes are agents, edges move messages or trigger retries. The design space includes agent roles and prompts, the topology of who talks to whom, per-edge message schemas, tool allocation, and the coordination protocol. An outer loop proposes small harness edits, runs them if they pass a cheap static type check, and consumes structured runtime feedback from the target: test verdicts, stdout/stderr, line-level coverage, and sanitiser outputs. A diagnoser reads both program signals and agent traces to decide why a run failed. Did the inputs never reach the interesting code? Or did execution reach it but fail to trigger a condition? The next edit targets that root cause instead of random thrashing.

On the TerminalBench-2 benchmark with Claude Opus 4.6, a single synthesised harness reached 84.3%, the top score on the public leaderboard snapshot they evaluated. More interestingly, the same loop, driven by Kimi K2.5, found ten previously unknown zero-days in Google Chrome, including two Critical sandbox-escape issues (CVE-2026-5280 and CVE-2026-6297). The practical point: with source access and instrumentation, runtime signals turn agent orchestration into an iterative search that can reliably reach and trigger deep bugs, not just toy crashes.

Why it works is mundane and therefore useful. Pass/fail is a blunt tool; coverage and sanitiser reports are specific. If coverage shows you never hit the target code, you restructure the agent graph or change tool access to improve reachability. If you hit it without a fault, you adjust inputs or the retry links and message schemas to push state over the edge. The harness becomes the optimisation surface, and a type system keeps the search from dissolving into nonsense.

There are caveats. This needs source and instrumented builds; without coverage and sanitiser signals the diagnoser loses teeth. The DSL uses static topologies, so no dynamic agent spawning mid-run. The loop still depends on LLM quality and a non-trivial compute budget, and the Chrome campaign is a case study rather than a compute-matched bake-off. Even so, the commercial read is clear: if you control the build and can collect runtime signals, orchestration engineering can matter more than swapping models. Expect red teams and motivated attackers with source access to copy this playbook. If you were hoping for magic, this is not it. It is plumbing, and the pipes apparently leak. Watch this space.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

Authors: Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng
LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets where the analyst can build and instrument the code. In practice the work is split among several agents, wired together by a harness: the program that fixes which roles exist, how they pass information, which tools each may call, and how retries are coordinated. When the language model is held fixed, changing only the harness can still change success rates by several-fold on public agent benchmarks, yet most harnesses are written by hand; recent harness optimizers each search only a narrow slice of the design space and rely on coarse pass/fail feedback that gives no diagnostic signal about why a trial failed. AgentFlow addresses both limitations with a typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol, paired with a feedback-driven outer loop that reads runtime signals from the target program itself to diagnose which part of the harness caused the failure and rewrite it accordingly. We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies how the program that organises multiple language-model agents, called a harness, determines success in automated vulnerability discovery. Hand-crafted harnesses are brittle and small changes can multiply task success rates, yet prior automatic harness optimisers search only narrow parts of the design space and use coarse pass/fail feedback that gives little diagnostic value. Improving harness synthesis is important because multi-agent systems are increasingly able to find real zero-day bugs that human auditors and classical fuzzers missed.

Approach

AgentFlow represents a harness as a typed graph domain-specific language in which nodes are agents and edges encode dataflow or retry links. The DSL exposes five first-class components: the agent set, communication topology, per-edge message schemas, tool allocation, and coordination protocol. An outer-loop optimiser proposes local edits to the DSL, validates them cheaply with a static type check, executes well-formed harnesses, and collects structured runtime feedback from the target program (test verdicts, stdout/stderr, line-level coverage, and sanitizer reports). A diagnoser reads full feedback and agent traces to localise why a harness failed and suggests corrective edits; the proposer uses that diagnosis and a short archive of past trials to emit the next harness proposal. The same synthesis loop was evaluated on the TerminalBench-2 benchmark using Claude Opus 4.6 and on the Google Chrome codebase using Kimi K2.5.

Key Findings

  • Full-space search with typing yields large gains: AgentFlow synthesised a single harness that achieves an 84.3% pass rate on TerminalBench-2 with Claude Opus 4.6, the highest score in the public leaderboard snapshot evaluated.
  • Runtime diagnostics materially improve search: by consuming coverage and sanitizer outputs, the diagnoser can distinguish failures that never reached vulnerable code from those that reached it but did not trigger an error, enabling targeted edits rather than random-walk changes.
  • Real-world impact on a large codebase: the same loop, driven by Kimi K2.5, found ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape bugs reported as CVE-2026-5280 and CVE-2026-6297, all confirmed by the vendor.

Limitations

AgentFlow requires source access and the ability to instrument builds so that coverage and sanitizer channels are available; without those channels diagnostic quality falls. The DSL restricts harnesses to static topologies, disallowing dynamic agent spawning during a single execution. The optimisation process depends on LLM proposers and diagnosers, and malformed proposals still occur, though roughly 20% are filtered cheaply by the type checker in experiments. The Chrome campaign used a mid-tier open-weight model and a substantial compute budget, so results are presented as a case study rather than a compute-matched comparative benchmark.

Implications

From an offensive-security viewpoint, AgentFlow shows that harness design alone can substantially increase automated exploit discovery. An attacker or red team with access to target source and instrumentation could automate multi-agent orchestration and use structured program signals to rapidly iterate towards inputs that reach and trigger bugs, including sandbox-escape flaws. The work highlights that harness synthesis is itself a critical attack surface: controlling the orchestrator can multiply the effectiveness of otherwise fixed LLM agents.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.