LLMs Map CVEs to Real-World Attacker Techniques

Defenses

Published: Tue, Aug 26, 2025 • By Natalie Kestrel

LLMs Map CVEs to Real-World Attacker Techniques

New research shows a hybrid LLM system can automatically map publicly disclosed vulnerabilities to ATT&CK techniques, speeding CVE triage. The method boosts recall by combining rule-based rules with in-context learning and finds GPT-4o-mini outperforming Llama3.3-70B. Teams must still watch for hallucination, data leakage, and misprioritization risks.

The paper offers a tidy promise: use large language models to turn cryptic CVE text into actionable ATT&CK techniques. TRIAGE combines rule-based methodology mappers with an in-context learning module to produce ranked technique lists, and the hybrid output improves recall for exploitation techniques.

On practical measures the results are believable but imperfect. In tests GPT-4o-mini beats Llama3.3-70B on ranking and recall. Costs are modest for research use - roughly US$0.07 and about 50 seconds per CVE with GPT-4o-mini - but latency and provider choice matter when you scale to thousands of new entries a year.

Why this matters: accurate mapping changes prioritization. If a CVE is linked to remote code execution versus information disclosure the patch cadence, firewall rules, and incident playbooks change. Get the mapping wrong and you reroute scarce defenses away from real danger. The paper flags the usual suspects - hallucination, prompt confusion, training data leakage, and sparse labels for secondary impacts - which all translate to real operational risk.

Case note: an automated pipeline that overcalls impact could push an org to triage 100s of low-risk flaws ahead of a few high-risk ones. Adversaries that can influence public descriptions or poison examples could nudge mappings in their favor.

Actionable checks teams can run now:

Spot-check 50 random CVE mappings against human analysts and track false positives and negatives.
Compare model suggestions to evidence sources like exploit databases before changing priority.
Log and rate-limit model inputs to detect data leakage or adversarial prompts.
Track mapping confidence and flag low-confidence secondary impacts for manual review.
Benchmark across models and prompt variants and include a human-in-the-loop gate for changes to SLAs.

TRIAGE is useful, not magical. Treat it like an intern you still have to supervise.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs

Vulnerability databases, such as the National Vulnerability Database (NVD), offer detailed descriptions of Common Vulnerabilities and Exposures (CVEs), but often lack information on their real-world impact, such as the tactics, techniques, and procedures (TTPs) that adversaries may use to exploit the vulnerability. However, manually linking CVEs to their corresponding TTPs is a challenging and time-consuming task, and the high volume of new vulnerabilities published annually makes automated support desirable. This paper introduces TRIAGE, a two-pronged automated approach that uses Large Language Models (LLMs) to map CVEs to relevant techniques from the ATT&CK knowledge base. We first prompt an LLM with instructions based on MITRE's CVE Mapping Methodology to predict an initial list of techniques. This list is then combined with the results from a second LLM-based module that uses in-context learning to map a CVE to relevant techniques. This hybrid approach strategically combines rule-based reasoning with data-driven inference. Our evaluation reveals that in-context learning outperforms the individual mapping methods, and the hybrid approach improves recall of exploitation techniques. We also find that GPT-4o-mini performs better than Llama3.3-70B on this task. Overall, our results show that LLMs can be used to automatically predict the impact of cybersecurity vulnerabilities and TRIAGE makes the process of mapping CVEs to ATT&CK more efficient. Keywords: vulnerability impact, CVE, ATT&CK techniques, large language models, automated mapping.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies how retrieval-augmented generation (RAG) systems can be universally corrupted by a small set of injected texts so that many diverse user queries produce attacker-chosen outputs. This matters because RAG is widely used in finance, healthcare and security, and prior attacks targeted single or similar queries rather than large, diverse query sets.

Approach

UniC-RAG partitions a large set of target queries using a balanced similarity-based clustering, then jointly optimises one adversarial text per cluster. Each adversarial text is decomposed into a retrievability component (Pir) and a manipulation component (Pig); HotFlip-style gradient optimisation with greedy initialisation is used to craft Pir while Pig is typically a prompt-injection payload. Threat model: attacker can inject texts and has white‑box access to the retriever, may or may not know the LLM, cannot change retriever or LLM parameters. Evaluations use Natural Questions, HotpotQA, MS-MARCO and a Wikipedia dump (47,778,385 chunks); four retrievers and seven LLMs including Llama variants and GPT models were tested.

Key Findings

High effectiveness: UniC-RAG achieves over 90% retrieval and attack success rates by injecting 100 adversarial texts to simultaneously attack hundreds to ~2,000 queries.
Aggregate performance: reported average Retrieval Success Rate 93.2% and Attack Success Rate 81.2% across datasets; outperforms baselines such as PoisonedRAG, Jamming and Corpus Poisoning.
Robustness of attack goals: can induce malicious links, harmful command execution and denial-of-service; evaluated defences (paraphrasing, expanded context windows, robust RAG variants like InstructRAG) were insufficient (example: InstructRAG DoS RSR 99.6% and ASR 70.4%).

Limitations

Assumes white‑box access to the retriever and ability to inject texts; black‑box retriever attacks not evaluated. Trade-off exists between retrievability and manipulation. Evaluations focus on QA-style tasks; generalisation to other RAG applications is noted but not fully measured. Other deployment constraints and detection rates are not reported.

Why It Matters

UniC-RAG demonstrates a scalable, realistic route to large‑scale contamination of RAG systems that can redirect users to harmful sites, trigger dangerous commands or degrade service. Current defences appear inadequate, emphasising the need for stronger data vetting, retrieval hardening and anomaly/adversarial detection to protect AI assistants used in critical domains.

Attribution Original paper on arXiv