Study exposes DRL pitfalls that compromise security

Defenses

Published: Tue, Feb 10, 2026 • By Dr. Marcus Halden

Study exposes DRL pitfalls that compromise security

This survey analyses 66 papers on Deep Reinforcement Learning (DRL) for cybersecurity and identifies 11 recurring methodological pitfalls. It finds an average of 5.8 pitfalls per paper and shows how modelling, evaluation and reporting choices produce brittle or misleading policies. The paper ends with concrete fixes to raise rigour and deployment safety.

This systematisation of knowledge looks at how Deep Reinforcement Learning (DRL) is being applied to security tasks and why many results should be treated with caution. The authors review 66 papers from 2018 to 2025, define 11 recurring methodological pitfalls, and run three controlled experiments in representative environments: autonomous cyber defence, adversarial malware generation, and web security testing.

What they did and how

The team collects literature from major technical archives, filters it to 66 relevant works, and rates each paper for the presence of each pitfall. Two experts review every paper, which gives the assessment more weight than a purely automatic scrape. They group pitfalls into four stages: environment modelling, agent training, performance evaluation, and system deployment. The paper stresses a foundational issue: many security problems are partially observable and non-stationary, which breaks core assumptions of standard Markov decision process models used by DRL.

Key findings

Across the corpus the average paper contains 5.8 of the 11 pitfalls. Policy convergence issues appear in 71.2 per cent of studies. Partial observability is not properly addressed in 60.6 per cent. Variance analysis and reporting are absent in about 66.7 per cent. Roughly 40.9 per cent evaluate in oversimplified environments. Hyperparameter reporting is patchy: 31.8 per cent give no hyperparameters at all, 36.4 per cent report partially, and only 31.8 per cent report them fully. Convergence is frequently undemonstrated; the papers split roughly a third each among no demonstration, partial demonstration, and full demonstration.

The controlled experiments make the abstract concrete. In a web testing task, reformulating the decision model raised discovered vulnerabilities by about 10.1 per cent and lifted the lower-performance bound by 17.3 per cent, showing that an algorithm's apparent gains can come from modelling choices rather than better learning. In the autonomous defence environment, changing action ordering or attacker behaviour revealed brittle policies that underperform when assumptions shift. For adversarial malware, trained policies sometimes only edge out naive baselines, underlining the need to separate environment engineering from genuine algorithmic progress.

These issues are not academic nitpicks. They translate into real risks: reward hacking where an agent exploits a flawed objective, failure to transfer from simulation to live systems, and brittle behaviour when attackers change tactics. The common result is a false sense of security: confident claims that do not survive more realistic tests or deployment.

The paper offers concrete mitigations: model security tasks as partially observable when appropriate, run multiple training seeds and report variance, show convergence curves, provide full hyperparameter lists, use stronger baselines and ablations, and test under adversarial and non-stationary conditions. Above all, distinguish algorithmic improvement from environment design.

Operational takeaways

Treat DRL prototypes as fragile: require multiple runs with variance reporting and convergence plots before accepting claims.
Insist on realistic, partially observable modelling and adversarial test cases to avoid oversold results.
Demand full hyperparameters, clear baselines and ablations so you can reproduce and stress-test any deployment.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

Authors: Shae McFadden, Myles Foley, Elizabeth Bates, Ilias Tsingenopoulos, Sanyam Vyas, Vasilios Mavroudis, Chris Hicks, and Fabio Pierazzi

Deep Reinforcement Learning (DRL) has achieved remarkable success in domains requiring sequential decision-making, motivating its application to cybersecurity problems. However, transitioning DRL from laboratory simulations to bespoke cyber environments can introduce numerous issues. This is further exacerbated by the often adversarial, non-stationary, and partially-observable nature of most cybersecurity tasks. In this paper, we identify and systematize 11 methodological pitfalls that frequently occur in DRL for cybersecurity (DRL4Sec) literature across the stages of environment modeling, agent training, performance evaluation, and system deployment. By analyzing 66 significant DRL4Sec papers (2018-2025), we quantify the prevalence of each pitfall and find an average of over five pitfalls per paper. We demonstrate the practical impact of these pitfalls using controlled experiments in (i) autonomous cyber defense, (ii) adversarial malware creation, and (iii) web security testing environments. Finally, we provide actionable recommendations for each pitfall to support the development of more rigorous and deployable DRL-based security systems.

🔍 ShortSpan Analysis of the Paper

Problem

The article surveys the field of Deep Reinforcement Learning for cybersecurity (DRL4Sec) and identifies 11 recurring methodological pitfalls that arise when DRL is applied to security tasks. It argues that the adversarial, non stationary and partially observable nature of most cyber problems violates core DRL assumptions, risking brittle policies and deployments that offer a false sense of security. The work quantifies how these pitfalls appear across 66 DRL4Sec papers published between 2018 and 2025, finds an average of more than five pitfalls per paper, and demonstrates their practical impact through controlled experiments in three representative domains: autonomous cyber defence, adversarial malware creation, and web security testing. It then presents actionable recommendations to improve the rigor and deployability of DRL based security systems, with the aim of supporting risk assessment, robust testing and better threat modelling for AI powered security solutions.

Approach

The authors perform a two round literature collection using IEEE Xplore, ACM Digital Library and Google Scholar, resulting in 66 papers after filtering for language and length. They define 11 pitfalls and categorise them across four stages of development: environment modelling, agent training, performance evaluation and system deployment. Each paper is reviewed by two experts and rated as present, partially present or not present for each pitfall, with substantial inter reviewer agreement. In addition to the review, the authors conduct controlled experiments in three DRL4Sec domains using existing environments: Autonomous cyber defence with MiniCAGE, adversarial malware generation with AutoRobust, and web security testing with the Link XSS and Sqirl environments. They discuss foundational issues such as the MD P formulation and the Markov property, emphasising that security tasks are often partially observable and may require P OMDPs or alternative modelling. The study also provides concrete recommendations for each pitfall and for reporting, reproducibility and deployment to advance the comparability and realism of DRL based security research.

Key Findings

The study identifies 11 recurring pitfalls for DRL in cybersecurity, organised across four development stages: environment modelling, training, evaluation and deployment.
Across 66 papers from 2018 to 2025, the average number of pitfalls per paper is 5.8 and every paper contains at least two pitfalls.
Policy convergence is the most common pitfall, present in 71.2 per cent of papers; underlying assumptions are present in 28.8 per cent of papers.
Partial observability is not adequately addressed in many studies, with 60.6 per cent failing to address it; variance analysis is often neglected, with 66.7 per cent not performing it; and 40.9 per cent evaluate in oversimplified environments.
Hyperparameter reporting is frequently incomplete; 31.8 per cent provide no hyperparameters, 36.4 per cent report partially, and 31.8 per cent report fully.
Variance reporting is generally lacking; 57.6 per cent of papers do not discuss multiple training runs or variance; 33.3 per cent report multiple runs with variance analysis; 9.1 per cent partially discuss.
Convergence is frequently un demonstrated; 42.4 per cent provide no convergence demonstration, 28.8 per cent partially demonstrate, and 28.8 per cent demonstrate convergence.
Experiments show that pitfalls can degrade performance or produce misleading results across domains. For example, in Link XSS payload generation, reformulating the M DP to distinct states increased vulnerability discovery by about 10.1 per cent and raised the lower bound by about 17.3 per cent, illustrating how modelling choices directly affect outcomes.
In the autonomous cyber defence case MiniCAGE, altering action order or attacker strategies reveals brittleness and sensitivity to deployment assumptions, indicating that policies trained under fixed assumptions may under perform when conditions shift.
In adversarial malware generation, the value of observability versus observability free settings demonstrates that even when trained policies show gains, random baselines can be competitive, emphasising the need to disentangle policy learning from environment design.
In web security testing, experiments show that performance gains can collapse when evaluating on more realistic or sanitized vulnerability sets, highlighting the danger of over optimistic results in simplified environments.
The paper emphasises that many studies may conflate improvements due to environmental design rather than genuine algorithmic advances, leading to false confidence and limited transfer to real world deployments.
Across domains the work documents several real world risks including reward hacking, sim to real transfer failures, and brittle policies under non stationary conditions. It also highlights the need for stronger baselines, ablations, and transparent reporting to validate DRL based security claims.

Limitations

The review covers 66 papers up to 2025 and relies on expert classifications, which, while achieving substantial agreement, are subject to interpretation. The experimental demonstrations use a subset of available DRL4Sec environments, which may not capture the full diversity of real world cyber security tasks. Findings and recommendations are based on published work and may be affected by publication bias and the evolving nature of the field.

Why It Matters

This SoK clarifies the methodological traps that can undermine the reliability and deployment readiness of DRL driven security solutions. By outlining a structured pitfall taxonomy, documenting their prevalence, and providing concrete mitigations across modelling, training, evaluation and deployment, the work offers practical guidance to raise the scientific and operational standards of AI enabled security tools. The emphasis on realistic environments, robust evaluation, convergence proof and thorough reporting supports safer risk assessment, more trustworthy threat models and better infrastructure for safeguarding AI powered security deployments.

Attribution Original paper on arXiv