Anchor LLMs with ATT&CK, Cut Pentest Hallucinations

Pentesting

Published: Wed, Sep 10, 2025 • By Theo Solander

Anchor LLMs with ATT&CK, Cut Pentest Hallucinations

New research shows constraining LLM-driven penetration testing to a fixed MITRE ATT&CK task tree dramatically cuts hallucinations and redundant queries while raising task completion rates across models. The method speeds automated assessments, helps smaller models succeed, and warns defenders to update mappings before attackers and tools weaponize the same guided approach.

New research shows that tying LLM-driven penetration testing to a deterministic MITRE ATT&CK task tree tames hallucinations, reduces redundant queries, and raises task completion across models. In HackTheBox exercises the guided pipeline completed roughly 72 to 79 percent of subtasks versus single-digit completion for a self-guided baseline on smaller models, while cutting queries by about half.

History likes this trick. After early waves of automation produced noisy, unreliable outputs, engineers reclaimed progress by grafting structure onto models: rulebooks, checklists, and standards. The study repeats that arc for security tooling. By anchoring reasoning to proven tactics and an externally maintained task tree, the agents avoid circular loops and provide auditable progression.

Why this matters now: defenders gain faster, cheaper and more consistent automated assessments and can make smaller, cheaper models useful in real workflows. The concerning flip side is that attackers can borrow the same pattern to automate more efficient attacks. Also, a static mapping is brittle; if the ATT&CK mapping lags, the agent misses modern techniques or follows stale paths.

Practical takeaway for teams: treat the task tree as infrastructure. Maintain and version the ATT&CK mappings, add CVE and telemetry integrations, log queries and confidence, and require human sign-off for intrusive steps. Regularly red-team your automated tester and measure hallucination and query metrics. Structure gives you speed and repeatability, but only active stewardship keeps it safe.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

Authors: Katsuaki Nakano, Reza Feyyazi, Shanchieh Jay Yang, and Michael Zuzak

Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM's reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8\%, 72.8\%, and 78.6\% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5\%, 16.5\%, and 75.7\% of subtasks and required 86.2\%, 118.7\%, and 205.9\% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

🔍 ShortSpan Analysis of the Paper

Problem

Large Language Models hold promise for automating penetration testing but rely on self-guided reasoning that often produces inaccurate or hallucinated procedural steps. This can lead to wasted effort or circular reasoning that repeats prior tactics. The paper studies whether constraining the LLMs to a deterministic MITRE ATT&CK based task tree can improve reliability and efficiency in automated vulnerability assessment.

Approach

A structured reasoning pipeline uses a Structured Task Tree STT built from MITRE ATT&CK to bound the LLMs reasoning to explicit tactics, techniques and procedures. The STT contains 30 techniques drawn from the MITRE ATT&CK Matrix, omitting post exploitation blue team and some other tactics. Task nodes have four fields: description completion status findings and next possible tasks. The pipeline operates via four stages: Task Initialization loads the initial task from the STT and prompts the LLM to generate commands; Output Summarization captures findings and decides whether to mark the task as complete; Task Selection chooses the next task from the STT based on findings; Command Generation produces executable commands for the tester. If a command is invalid, the agent marks the task as in progress and generates a new command, failing the task after five invalid commands. The STT is maintained externally by code, not regenerated by the LLM, aiding consistency and efficiency. The evaluation compares this STT based method with a baseline that uses a self guided Pentesting Tree PTT, where the LLM freely generates and updates the tasks. The evaluation uses three LLMs Llama-3-8B Gemini-1.5 and GPT-4 across ten HackTheBox machines comprising 103 subtasks with a mix of easy medium and hard targets and an even Windows Linux distribution. The study measures subtask completion and number of LLM queries.

Key Findings

Guided STT based reasoning completed 71.8 per cent of subtasks with Llama-3-8B, 72.8 per cent with Gemini-1.5, and 78.6 per cent with GPT-4, versus 13.5 per cent, 16.5 per cent and 75.7 per cent for the self guided baseline respectively.
The STT based approach reduced model queries by around 55.9 per cent on average compared with the baseline, and enabled smaller models to complete four machines each whereas the baseline failed to complete any machine with those models.
GPT-4 achieved similar machine completion to the baseline but provided higher subtasks coverage and improved efficiency; the framework consistently reduced circular reasoning and hallucinations compared with self guided reasoning, and maintained progression tracking through the externally managed STT.

Limitations

Limitations include absence of web search for CVEs, restricted exploitation tool capability handling and difficulty exploiting complex CVEs such as MS14-068 on hard targets; practical limitations of a static knowledge base in a real world environment; the STT excludes certain techniques and therefore cannot cover all possible attack methods. Future work may include retrieving CVEs, integrating multimodal reasoning for tool interfaces and dynamic updates to the ATT&CK mapping to stay current.

Why It Matters

Constraining LLM driven penetration testing to a deterministic MITRE ATT&CK based task tree can tame hallucinations, improve consistency and reduce the number of model queries required, making automated assessments faster and more reliable. The approach offers a transferable framework for other structured domains and highlights security implications if attackers adopt similar guided approaches. It also has governance implications for how automated assessments are used in practice and how to manage workforce impacts. The authors note that the method enhances reliability and efficiency of AI assisted security work and provide open source code on a public repository for reuse and extension.

Attribution Original paper on arXiv