Agents Synthesize CodeQL Queries to Find Vulnerabilities

Agents

Published: Wed, Feb 11, 2026 • By Lydia Stratus

A neuro-symbolic triad uses LLMs to generate CodeQL queries and validate results through semantic review and exploit synthesis. On Python packages it rediscovers historical CVEs with 90.6% accuracy, finds 39 medium-to-high issues in the Top100 including five new CVEs, and reduces noise substantially while keeping runtime and token costs low.

Static application security testing, or SAST, is the backbone of most DevSecOps pipelines but it has the same annoying limitations: rules written by humans, a lot of false positives, and blind spots when code does clever things across modules. A recent paper introduces QRS, a practical attempt to push SAST beyond hand‑coded patterns by using a trio of autonomous agents that generate and validate CodeQL queries. The result is not magic; it is a pragmatic engineering pattern that deserves attention from ops and security teams.

How QRS works

QRS uses three specialised agents named TheQuery, TheReview and TheSanitize. TheQuery synthesises executable CodeQL queries from a compact knowledge schema and a few examples. TheReview traces data flows, reasons about semantics and, when possible, produces minimal proof of concept exploit constructs. TheSanitize then prunes noise and produces evidence based, context free assessments. The system is model agnostic and designed to run locally with minimal exposure of the codebase to Large Language Models (LLMs). Output is produced as machine readable artefacts such as JSON and SARIF for integration into existing pipelines.

The evaluation is blunt and useful. On a set of 20 historical CVEs in popular PyPI projects QRS detected 19, an accuracy of 90.6%. Scanning the 100 most downloaded PyPI packages turned up 39 medium to high severity issues. Five of those led to new CVE assignments and five produced documentation updates from maintainers; 29 were concurrently discovered by other researchers, which supports the validity of the findings. The Sanitize agent reduced noise by about 64.8%, and the paper reports only small false negative rates across tests. Costs and runtimes are intentionally modest: a 10 configuration scan averaged roughly $5.30 per package in token costs, which is cheaper than many manual audits.

Operational considerations

This is the sort of research I like because it ties model outputs to concrete SRE problems: query artifacts, reproducible evidence, and SARIF that your pipeline can consume. That said, this is not an all purpose replacement for dynamic testing or human review. The paper explicitly excludes compiled extensions, native code, logic bugs that need runtime context, supply chain distribution attacks and adversarial evasion. Ground truth for the Top100 scan relies on expert review rather than exhaustive audits, so expect some bias in precision and recall numbers.

If you want to try this approach in production, treat it as a complement to existing SAST. Run generated queries locally or in a restricted analysis environment, require the Review agent to attach data flow traces and minimal PoC artefacts before creating high severity tickets, and log the schema and prompts used for auditability. Governance matters: autonomous query generation and exploit synthesis require controls, especially if you allow models to access internal code. The paper shows promise and practical yields on real packages, but remember the remaining gaps. This advances static analysis. It does not replace on-call experience figuring out why a race condition only crashes at 03:17 on a Sunday.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

QRS: A Rule-Synthesizing Neuro-Symbolic Triad for Autonomous Vulnerability Discovery

Authors: George Tsigkourakos and Constantinos Patsakis

Static Application Security Testing (SAST) tools are integral to modern DevSecOps pipelines, yet tools like CodeQL, Semgrep, and SonarQube remain fundamentally constrained: they require expert-crafted queries, generate excessive false positives, and detect only predefined vulnerability patterns. Recent work has explored augmenting SAST with Large Language Models (LLMs), but these approaches typically use LLMs to triage existing tool outputs rather than to reason about vulnerability semantics directly. We introduce QRS (Query, Review, Sanitize), a neuro-symbolic framework that inverts this paradigm. Rather than filtering results from static rules, QRS employs three autonomous agents that generate CodeQL queries from a structured schema definition and few-shot examples, then validate findings through semantic reasoning and automated exploit synthesis. This architecture enables QRS to discover vulnerability classes beyond predefined patterns while substantially reducing false positives. We evaluate QRS on full Python packages rather than isolated snippets. In 20 historical CVEs in popular PyPI libraries, QRS achieves 90.6% detection accuracy. Applied to the 100 most-downloaded PyPI packages, QRS identified 39 medium-to-high-severity vulnerabilities, 5 of which were assigned new CVEs, 5 received documentation updates, while the remaining 29 were independently discovered by concurrent researchers, validating both the severity and discoverability of these findings. QRS accomplishes this with low time overhead and manageable token costs, demonstrating that LLM-driven query synthesis and code review can complement manually curated rule sets and uncover vulnerability patterns that evade existing industry tools.

🔍 ShortSpan Analysis of the Paper

Problem

Static Application Security Testing SAST tools are central to modern secure software development lifecycles but remain limited by the need for expert crafted queries, high false positives and detection of only predefined vulnerability patterns. Prior work integrating large language models largely uses LLMs to triage existing tool outputs rather than reason about vulnerability semantics directly. The paper introduces QRS, a neuro symbolic triad that inverts this paradigm by using three autonomous agents to generate CodeQL queries from a structured schema and few shot examples, then validate findings through semantic reasoning and exploit synthesis, enabling discovery beyond fixed patterns and reducing false positives.

Approach

QRS orchestrates three specialised agents. TheQuery synthesises CodeQL queries from a compact knowledge schema and high level vulnerability descriptions, producing executable static analysis logic. TheReview performs semantic verification by analysing code context, tracing data flows from source to sink, and generating exploitation suggestions before confirming findings. TheSanitize provides a final context free and evidence based assessment to prune noise, assign labels, and deliver high confidence vulnerability reports. The framework is model agnostic and can operate with CodeQL across languages supported by CodeQL predicates; it is designed to run locally with minimal exposure of the codebase to LLMs. It uses retrieval augmented schema learning, iterative syntax refinement with self healing, and an optional natural language steering interface to focus on specific CWEs or severities. QRS evaluates on two real world Python package datasets Hist20 and Top100, using expert review for ground truth and a multi stage vulnerability taxonomy aligned with MITRE CWE. It produces JSON/Sarif artefacts for integration with security workflows and supports exploitation synthesis via minimal PoC constructs. The evaluation compares against traditional tools and assesses detection accuracy, precision, recall and false positive reduction, while varying model families and token budgets to balance coverage and cost.

Key Findings

Hist20 evaluation: QRS detected 19 of 20 historical CVEs in popular PyPI libraries, achieving 90.6% detection accuracy with an average configuration consensus of 2.7 out of 4 across four configurations.
Top100 real world scan: QRS identified 39 medium to high severity vulnerabilities across the 100 most downloaded PyPI packages. Of these, 5 were newly assigned CVEs and 5 were acknowledged via maintainer documentation updates, while 29 vulnerabilities were independently discovered by other researchers, validating severity and discoverability.
Novel and known vulnerabilities: Across Top100, QRS rediscovered 29 previously known CVEs and identified 5 novel vulnerabilities in 4 packages that include complex patterns such as TOCTOU race conditions, memory exhaustion, and path traversal via dependencies; several CVEs were patched rapidly after disclosure.
Exploitation and evidence: PoC exploitation was synthesised for findings when feasible, with evidence based reasoning, data flow traces and context supporting verdicts. The system demonstrated that exploitability can be inferred and demonstrated beyond mere pattern matching.
False positive reduction and recall: In worst case, the Sanitize agent reduced noise by 64.8%, while overall recall was high with minimal false negatives (Hist20 2.2% FN; Top100 0.3% FN). The framework delivered a balanced trade off between precision and recall, prioritising detection of vulnerabilities over minimising FP.
Cost and efficiency: Across Top100, a 10 configuration scan package averaged around $5.3, with runtimes and token costs kept manageable, highlighting the cost effectiveness of end to end LLM guided query synthesis compared with manual audits and traditional tools.
Tool comparison: Compared with Opengrep and Bandit, which produced substantially more alerts, and with CodeQL’s standard ruleset, QRS produced fewer but more meaningful findings including complex cross module patterns that evade conventional static analysis.
Cross language potential: Although evaluated on Python, the framework is designed to scale to other languages supported by CodeQL by adapting the knowledge predicates, enabling broad applicability.

Limitations

The evaluation scope excludes vulnerabilities in compiled extensions or native code, logic vulnerabilities requiring dynamic analysis, supply chain attacks at the package distribution level, and adversarial evasion attempts. Ground truth for Top100 relies on expert review rather than exhaustive audit, which may bias precision and recall estimates. The current work also does not evaluate runtime or sandboxed dynamic validation of exploits, and there is no end to end automatic patching workflow. The approach relies on a knowledge schema and CodeQL predicates, so coverage depends on the breadth of the known CWE space and CodeQL capabilities. The architecture is modular but its real world performance will depend on model behaviours and prompt engineering, with variations across model families observed.

Why It Matters

QRS demonstrates that large language model guided query synthesis can complement traditional SAST by expanding vulnerability coverage beyond predefined rule sets while reducing false positives. The real world evaluations on Python packages show that autonomous agents can rediscover known CVEs and uncover new higher severity issues at a practical cost, suggesting a path towards more proactive, scalable security analytics in open source ecosystems. The work also highlights AI governance implications; autonomous query generation, reasoning and PoC synthesis raise considerations for auditing, reproducibility and safe use within security operations. Overall QRS provides a compelling proof of concept that LLM driven query synthesis can enhance reliability and coverage in static analysis, offering a practical route to stronger protection against complex, real world vulnerabilities.

Attribution Original paper on arXiv