Teach LLMs Security Specs to Find Bugs

Pentesting

Published: Fri, Nov 07, 2025 • By Clara Nyx

Researchers introduce VulInstruct, a method that teaches Large Language Models (LLMs) explicit security specifications mined from past patches and CVEs to detect vulnerabilities. On a strict benchmark it raises F1 and recall substantially and uniquely finds many bugs. The approach even uncovered a real high severity CVE, showing practical value for automated code review.

Large Language Models (LLMs) are very good at explaining code but still stumble when asked to spot real vulnerabilities. The paper examined here argues the reason is simple: LLMs rarely see explicit security specifications during training. They learn surface patterns instead of the rules that make code safe. That matters because a vulnerability is really a mismatch between what code does and what it should do.

What they did

The authors build VulInstruct, a specification knowledge base that teaches an LLM what safe behaviour looks like. One stream extracts general specifications from high quality patches across projects, restating the implied safe behaviours around functions, types and imports. The other stream gathers domain specific evidence from past CVEs to capture repeated failure modes in similar repositories. At query time VulInstruct retrieves relevant specs and past cases, scores their relevance, generates focused domain specifications, and asks the LLM to reason about the target function against those specs.

The evaluation uses PrimeVul with strict temporal controls to avoid leakage and compares against prompting, fine tuning and retrieval baselines. The gains are concrete: VulInstruct achieves a 45.0 percent F1 score, a 32.7 percent relative improvement over the strongest baseline, and 37.7 percent recall, a 50.8 percent improvement. It also uniquely detects 24.3 percent of vulnerabilities, 2.4 times more than any baseline in the study. Pairwise discrimination improves as well. Beyond benchmarks the system discovered a previously unknown high severity vulnerability assigned CVE-2025-56538, demonstrating real world payoff.

Why it matters and where to be careful

The idea is straightforward and useful. Teach models the rules of safe behaviour, not just examples of broken code, and they reason better. For practical defenders that means higher recall, clearer explanations of root causes, and fewer misses when code has subtle logic errors rather than obvious pattern matches.

That said, the approach is not magic. Its effectiveness rests on the quality, freshness and provenance of the specification corpus. Spec drift, stale examples, or deliberate poisoning of the spec base can mislead the model. The authors note reliance on an LLM as a judge in parts of their evaluation, which can inject bias. There is also maintenance cost: assembling high quality patch-derived specs and keeping domain evidence current is labour intensive.

Still, the results are actionable. If you run security reviews or triage tool output, consider two steps: first, collect a small, well curated corpus of patch-derived specifications from your own codebase and feed them into any retrieval-augmented pipeline so the model reasons from your expected behaviours. Second, treat specification provenance as a metric: tag each spec with source, timestamp and confidence, and automate regular revalidation to guard against spec drift or manipulation. Those two moves give you better alerts and a defence that an LLM cannot easily game.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Specification-Guided Vulnerability Detection with Large Language Models

Authors: Hao Zhu, Jia Li, Cuiyun Gao, Jiaru Qian, Yihong Dong, Huanyu Liu, Lecheng Wang, Ziliang Wang, Xiaolong Hu, and Ge Li

Large language models (LLMs) have achieved remarkable progress in code understanding tasks. However, they demonstrate limited performance in vulnerability detection and struggle to distinguish vulnerable code from patched code. We argue that LLMs lack understanding of security specifications -- the expectations about how code should behave to remain safe. When code behavior differs from these expectations, it becomes a potential vulnerability. However, such knowledge is rarely explicit in training data, leaving models unable to reason about security flaws. We propose VulInstruct, a specification-guided approach that systematically extracts security specifications from historical vulnerabilities to detect new ones. VulInstruct constructs a specification knowledge base from two perspectives: (i) General specifications from high-quality patches across projects, capturing fundamental safe behaviors; and (ii) Domain-specific specifications from repeated violations in particular repositories relevant to the target code. VulInstruct retrieves relevant past cases and specifications, enabling LLMs to reason about expected safe behaviors rather than relying on surface patterns. We evaluate VulInstruct under strict criteria requiring both correct predictions and valid reasoning. On PrimeVul, VulInstruct achieves 45.0% F1-score (32.7% improvement) and 37.7% recall (50.8% improvement) compared to baselines, while uniquely detecting 24.3% of vulnerabilities -- 2.4x more than any baseline. In pair-wise evaluation, VulInstruct achieves 32.3% relative improvement. VulInstruct also discovered a previously unknown high-severity vulnerability (CVE-2025-56538) in production code, demonstrating practical value for real-world vulnerability discovery. All code and supplementary materials are available at https://github.com/zhuhaopku/VulInstruct-temp.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models have shown strong performance on code understanding but struggle with vulnerability detection, particularly in distinguishing vulnerable code from patched code. The authors argue that a key reason is the lack of explicit security specifications—the expectations about how code should behave safely—which are rarely captured in training data. When code behaviour deviates from these specifications a vulnerability may arise, yet such knowledge is often implicit rather than documented. The paper introduces VulInstruct, a specification guided approach that extracts reusable security specifications from historical vulnerabilities to instruct detection of new ones. By building a specification knowledge base from two perspectives general specifications derived from high quality patches across projects and domain specific specifications derived from recurrent violations in relevant repositories, VulInstruct enables LLMs to reason about expected safe behaviours rather than relying on surface patterns. This work aims to improve vulnerability detection recall and precision and to support real world vulnerability discovery in software projects.

Approach

VulInstruct constructs a specification knowledge base (SKB) with two complementary pipelines. General specifications are extracted from high quality patch datasets by comparing vulnerable code with fixes and restating the implied safe behaviours, together with surrounding context such as callee functions, type declarations, imports and global variables. Domain specific specifications are derived from a large CVE database, capturing common exploitation mechanisms within the same domain to reflect attacker perspectives. The architecture combines offline knowledge construction with online specification guided detection. Online, VulInstruct performs dual path retrieval: spec level retrieval over general specifications and case level retrieval over domain evidence to assemble relevant knowledge. It then applies a three stage process: Knowledge Scoring which rates retrieved specifications, detailed domain specific specification generation from high relevance CVE cases, and Structured Reasoning in which the target function is analysed against the scored specifications to produce a vulnerability decision with explanations. The VKB stores general specifications and detailed vulnerability cases and a domain evidence base. Retrieval uses embeddings to fetch top candidates, supported by identifiers and domain keywords to prioritise domain relevant CVEs. The evaluation uses PrimeVul as the benchmark and CORRECT as a knowledge source to build the VKB, with strict temporal constraints to prevent data leakage. The authors compare against prompting, fine tuning and retrieval augmented baselines, reporting substantial improvements.

Key Findings

On PrimeVul VulInstruct achieves 45.0 per cent F1 score, a 32.7 per cent relative improvement over the strongest baseline, and 37.7 per cent recall, a 50.8 per cent improvement, while uniquely detecting 24.3 per cent of identified vulnerabilities, 2.4 times more than any baseline.
In pair wise evaluation VulInstruct attains 17.2 per cent P C accuracy, a 32.3 per cent relative improvement over the best baseline, indicating improved discrimination between vulnerable and patched code.
Beyond benchmarks, VulInstruct identified a previously unknown high severity vulnerability in production code later assigned CVE 2025 56538, demonstrating practical real world value for vulnerability discovery.
Ablation analyses show the two specification streams general and domain specific are complementary: removing domain specific specifications reduces F1 from 45.0 to 40.4 per cent, removing general specifications and detailed vulnerability cases reduces it to 41.0 per cent; VP S declines more sharply without domain specific specifications, highlighting their importance for distinguishing vulnerable from patched code.
Cross model evaluation shows VulInstruct improves performance across diverse language models, with notable gains on GPT based and Claude based systems. A representative setup using DeepSeek R1 with VulInstruct achieves best P C accuracy of 22.8 per cent, accuracy 56.6 per cent and precision 62.8 per cent, establishing a new state of the art in this setting.
Case studies illustrate how retrieved specifications guide reasoning, for example in a buffer overflow decoupled from input validation where domain and general specifications helped to identify and explain the root cause and the fix.

Limitations

Effectiveness depends on the quality and freshness of the specification knowledge base; spec drift or adversarial manipulation of the specification corpus could degrade performance. The approach relies on strict provenance and ongoing updates to maintain relevance. The CORRECT evaluation framework employs an LLM as judge which, while powerful, may introduce biases. The authors acknowledge limitations in current contextual datasets and the need for richer security oriented data sources and broader evaluation across real world artefacts. Temporal constraints are essential to prevent leakage but may constrain knowledge retrieval in fast moving domains.

Why It Matters

The specification guided paradigm offers a principled way to teach LLMs about security rather than relying on surface pattern matching. By retrieving and applying explicit security specifications derived from patches and CVEs, VulInstruct improves vulnerability detection accuracy and recall, enhances the ability to distinguish vulnerable from patched code, and demonstrates real world applicability by uncovering a high severity CVE in production software. These results suggest practical benefits for automated security reviews and vulnerability discovery in software projects, particularly for AI assisted code analysis and secure coding audits. The approach also highlights potential risks related to knowledge base quality and the need for robust provenance and updating protocols to prevent spec drift or manipulation.

Attribution Original paper on arXiv