Agentic LLMs Reproduce Linux Kernel PoCs

Agents

Published: Wed, Feb 11, 2026 • By Elise Veyron

Agentic LLMs Reproduce Linux Kernel PoCs

A study finds autonomous Large Language Model (LLM) agents can reproduce proofs of concept (PoCs) for real Linux kernel vulnerabilities in over 50% of cases. K-Repro automates code browsing, building and debugging inside virtual machines, often finishing within tens of minutes at a few dollars per case, though race and temporal memory bugs remain hard.

Reproducing vulnerabilities in the Linux kernel is a difficult, low level task. It often requires understanding subtle triggering chains, building specialised kernels and running carefully instrumented tests. That makes it a useful stress test for autonomous security systems based on Large Language Models (LLMs). The paper under review presents a systematic, large scale experiment that asks how far an agentic LLM system can go when given a security patch and tasked to generate a concrete proof of concept (PoC).

The authors build K-Repro, an agentic system that combines an LLM with a compact toolset: code browsing, virtual machine management, VM interaction and debugger controls. Given a patch commit, K-Repro builds a vulnerable kernel image inside QEMU, runs tests with sanitiser support and iterates until it produces a PoC or times out. The evaluation uses 100 real, exploitable Linux kernel vulnerabilities from KernelCTF and validates outcomes with deterministic runs and memory error detection.

What the study shows

K-Repro achieves non-trivial results. The higher-capability model reproduced PoCs in 56.0% of the 100 cases; a medium model reproduced 47.0%. Average monetary cost per case is modest: about 4 US dollars (4.0 for the higher model, 4.7 for medium). Average execution time is measured in minutes rather than hours: roughly 19.3 minutes for the higher-capability model and 33.8 minutes for the medium model, although individual cases vary widely.

Success is not usually immediate. The agent typically makes multiple PoC attempts and calls tools extensively: the higher model averages around 129 tool calls per case including source browsing, VM interactions and debugging actions. One-shot PoC generation is uncommon, and successful runs tend to require fewer iterations than failed ones. Certain classes of bugs remain stubborn: race conditions and temporal memory faults such as use-after-free and double free are substantially harder to reproduce than out-of-bounds faults. The study also finds that detailed commit messages and subsystem-specific clues improve reproducibility, while the model knowledge cutoff date has no clear effect on success.

Comparison and limits

On an 85-case subset aligned to prior work, K-Repro's stronger model reproduced 57 vulnerabilities (67.1%) versus 42 reproductions by a non-LLM approach called SyzDirect in its original evaluation. The time contrast is stark: successful K-Repro runs average about 17 minutes, whereas SyzDirect's successful reproductions averaged roughly 11.7 hours. Cost per case for K-Repro on that subset is reported near 2.19 US dollars.

Those numbers demonstrate feasibility but not universality. The authors note limitations: the testbed uses upstream kernels and specific configurations, automatic success is judged by PoC-triggered crashes which do not always prove identical root cause, and some failures stem from environment differences or brittle race triggers. The paper is careful about these boundaries.

The practical implication is clear and mixed. Autonomous agents can materially accelerate low-level vulnerability reproduction and provide new tooling for both offensive research and defensive triage. At the same time defenders and vendors should treat these results as a prompt to reassess N-day risk, prioritise detection of race and temporal memory bugs and invest in robust reproduction pipelines. The broader questions remain: how well these agents generalise to downstream and proprietary kernels, and how to govern capable reproduction tools so they serve safety and security rather than amplifying risk.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduction

Authors: Juefei Pu, Xingyu Li, Haonan Li, Zhengchuan Liang, Jonathan Cox, Yifan Wu, Kareem Shehada, Arrdya Srivastav, and Zhiyun Qian

Autonomous large language model (LLM) based systems have recently shown promising results across a range of cybersecurity tasks. However, there is no systematic study on their effectiveness in autonomously reproducing Linux kernel vulnerabilities with concrete proofs-of-concept (PoCs). Owing to the size, complexity, and low-level nature of the Linux kernel, such tasks are widely regarded as particularly challenging for current LLM-based approaches. In this paper, we present the first large-scale study of LLM-based Linux kernel vulnerability reproduction. For this purpose, we develop K-Repro, an LLM-based agentic system equipped with controlled code-browsing, virtual machine management, interaction, and debugging capabilities. Using kernel security patches as input, K-Repro automates end-to-end bug reproduction of N-day vulnerabilities in the Linux kernel. On a dataset of 100 real-world exploitable Linux kernel vulnerabilities collected from KernelCTF, our results show that K-Repro can generate PoCs that reproduce over 50\% of the cases with practical time and monetary cost. Beyond aggregate success rates, we perform an extensive study of effectiveness, efficiency, stability, and impact factors to explain when agentic reproduction succeeds, where it fails, and which components drive performance. These findings provide actionable guidance for building more reliable autonomous security agents and for assessing real-world N-day risk from both offensive and defensive perspectives.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines whether autonomous large language model based systems can reproduce real Linux kernel vulnerabilities with concrete proofs of concept. It addresses the gap in systematic evaluation of end to end vulnerability reproduction in the Linux kernel, a large, low level code base where bug triggers often require multi step sequences and precise triggering conditions. The authors introduce a reproducible, automated approach to assess how far agentic LLM systems can operate in this challenging domain and what factors drive success for offensive style vulnerability reproduction and for defensive risk assessment.

Approach

The authors design K Repro, an autonomous LLM based agent equipped with a compact toolset for code browsing, virtual machine management, VM interaction and debugging. Given a security patch commit as input, K Repro builds a vulnerable kernel in a VM, generates a PoC without internet access, and outputs the PoC, a reproduction analysis report and an execution trace. The system integrates four tool types Code Browsing Tools VM Management Tools VM Interaction Tools and VM Debugging Tools, and relies on high level and technical prompts to guide the agent. Repro results are validated by running in Qemu with a deterministic pipeline, using GCC 10, a kernel configuration aligned with Syzbot, and KASAN for memory error detection. The evaluation uses 100 KernelCTF real world vulnerabilities and compares the LLM based approach with SyzDirect a state of the art non LLM method. Experiments include multiple runs to study effectiveness efficiency stability and convergence, and assess robustness under constrained toolsets and weaker input signals. The authors also provide an aligned 85 case subset to compare with SyzDirect under the same settings and kernels.

Key Findings

On 100 KernelCTF vulnerabilities the X High model reproduces PoCs in 56.0 percent of cases while the Medium model achieves 47.0 percent.
Average monetary cost per case is four dollars for both models with Medium at 4.7 and X High at 4.0, and average execution times per case are 33.8 minutes for Medium and 19.3 minutes for X High, with notable variation across cases.
Maximum per case costs reach 14.3 dollars for successful Medium cases and 31.3 dollars for failed Medium cases; runtimes peak at 127.6 minutes for successful Medium cases and 299.7 minutes for failed Medium cases, while X High shows maxima of 8.0 and 10.9 dollars and 33.9 and 70.7 minutes respectively.
Ninety per cent of successful X High reproductions finish within 21.6 minutes and all successful X High runs finish within 33.9 minutes, whereas 90th percentile for Medium is 39.2 minutes with slower cases reaching 127.6 minutes.
Across runs, X High averages 129.5 tool calls per case including 9.2 VM management actions 55.5 VM interactions 40.5 source browsing and 24.4 debugging actions; Medium uses 165.1 calls with 11.7 VM management 91.7 VM interactions 38.7 source browsing and 23.0 debugging actions, showing active use of static analysis dynamic verification and debugging.
PoC generation typically requires multiple iterations; X High averages 7.4 PoC generation attempts per case versus 9.1 for Medium, with successful cases requiring far fewer iterations (4.9 and 4.5 respectively).
One shot PoC generation is uncommon with 10.7 percent for X High and 21.3 percent for Medium; some one shot successes occur even after knowledge cutoff date suggesting that memorised PoCs are not the sole driver of success.
Breakpoint based debugging via GDB is common; breakpoints are hit in most cases, but hits in failed runs show that reachability is not the main bottleneck; GDB provides limited marginal benefit for reproduction.
Race condition vulnerabilities are substantially harder to reproduce with success rates around one third to two fifths compared with non race conditions; temporal memory bugs such as UAF/DF are statistically harder to reproduce even when controlling for race conditions.
Temporal memory faults are more difficult than out of bounds faults after controlling for race condition, with statistically significant odds ratios indicating lower success for UAF/DF compared with OOB.
Rich commit messages improving reproducibility; removing commit messages reduces success especially when messages contain triggering information; overall performance remains reasonably strong though.
Knowledge cutoff date shows no statistically significant effect on reproduction success for either model; pre cutoff and post cutoff cases have comparable success rates after controlling for other factors.
Convergence occurs across runs; two identical runs yield similar results and a third iteration recovers a small additional set of successes, with diminishing returns thereafter.
In a three way robustness analysis, prompt degradation reduces success slightly and alters tool usage patterns, while removal of subsystem specific utilities has small effects on success but changes PoC structure, and removing GDB has little impact on outcomes indicating that other factors drive success.
On a SyzDirect aligned subset of 85 cases, K Repro X High reproduces 57 vulnerabilities (67.1 percent) while SyzDirect achieved 42 successful reproductions in its original evaluation; end to end reproduction time for X High is about 17.04 minutes on successful cases and 22.48 minutes on average when failed cases are included, compared with SyzDirect which averages about 11.74 hours per successful case and about 18.85 hours overall for failed ones; cost per case for K Repro is around 2.19 dollars with 1.58 dollars for successful cases.
Case studies reveal that incorrect root cause analysis can cause failures and that patch level signals can mislead the agent; in one instance an X High degraded prompt case succeeds by exploiting crypto backend backlog behavior when specifically guided to inspect crypto subsystem code, highlighting that targeted analysis of triggering chains can be key to success.

Limitations

The study relies on KernelCTF real world vulnerabilities and a kernel bootable upstream based environment, which may not capture all real world conditions or restrictions present in downstream kernels or proprietary systems. Automatic success is defined by the presence of a PoC and a crash, which while standard, may not fully capture root cause equivalence in all cases. The evaluation focuses on a subset of subsystems and vulnerabilities and excludes internet access during PoC generation. Some failures arise from environment configuration differences or difficulty in triggering race conditions, and root cause analysis can misinterpret patch signals especially in complex vulnerabilities. The comparison with Syz Direct uses an aligned subset and kernel configurations that may differ from broader deployment scenarios.

Why It Matters

The work demonstrates that autonomous LLM based agents can perform kernel level vulnerability reproduction at non trivial success rates within tens of minutes per case and modest monetary cost, showing feasibility for deep low level vulnerability research. It identifies the tool components most associated with success including code browsing dynamic verification and debugging, and highlights that race conditions and temporal memory bugs remain challenging. The findings provide actionable guidance for building more reliable autonomous security agents and for assessing real world N day risk from offensive and defensive perspectives. The study also raises safety and reliability considerations as such tools become more capable highlighting the need for careful governance and evaluation in practical deployments.

Attribution Original paper on arXiv