Memory aids RL pen-testing robustness and transfer

Pentesting

Published: Thu, Sep 25, 2025 • By Natalie Kestrel

Researchers train reinforcement learning agents to run simulated, partially observable penetration tests and compare policy variants. Augmenting observations with recent history outperforms recurrent and transformer models, converging about three times faster and generalising better across network sizes. The work flags gaps in observability and urges memory-aware defences against automated attacks.

Penetration testing becomes a sequential decision problem as an agent probes hosts, scans services and tries exploits. The catch is partial observability: the agent never sees the full state, so the simplifying Markov property of many models breaks down. That matters because memory, or a believable proxy for it, changes what automated attackers learn to do.

What the researchers did

The authors adapt an existing simulator into StochNASim, a partially observable, stochastic testbed that generates new networks each episode. The environment supports up to eight hosts and seven action types, including Exploit, Privilege Escalation, Service Scan, Process Scan, OS Scan, Subnet Scan and No Op. Actions have costs and probabilistic outcomes. The paper compares a vanilla Proximal Policy Optimisation (PPO) baseline with several PPO variants designed to mitigate partial observability: frame-stacking, observation augmentation with recent history, an LSTM recurrent policy and a transformer policy (TrXL).

Across networks of varying size the clear winner is the simple observation augmentation approach. Feeding recent history into the policy produces higher cumulative reward and convergence about three times faster than the more complex recurrent and transformer architectures. Frame stacking and the LSTM variant land in the middle or perform worse. The transformer tends toward a brute-force tactic of sequentially firing exploits without much scanning. Manual inspection of behaviours supports the quantitative results: the observation-augmented policies scan more, use information efficiently and transfer better when deployed on unseen network configurations.

The authors also show that policies trained on a fixed, deterministic benchmark transfer poorly to new networks, while policies trained on the stochastic environment generalise more reliably. That is a small but important lesson: diversity in training environments can expose brittle assumptions and produce more robust strategies.

Why defenders should care

The study is a reminder that memory matters. An attacker agent that aggregates history or uses a simple, hand-crafted belief feature set can learn efficient probing strategies faster than one that relies on complex recurrent learning. From a blue-team perspective that suggests two blind spots. First, telemetry that presents only snapshots invites strategies that stitch those snapshots together. Second, detection models that ignore history or temporal patterns will miss agents that perform measured scans and staged escalation across many steps.

There are obvious limits. This work runs in simulation, the diversity of operating systems and services is constrained, and dynamic changes during an episode are not modelled. The handcrafted observation augmentation the paper favours could be brittle in real networks. Transformers and recurrent models also come with higher computational cost, which the authors note.

Still, the takeaway is practical: improving observability and introducing memory-aware detection makes automated, partially observable attacks harder to hide. Teams should treat temporal linkage as a first-class signal and validate detection against agents trained in stochastic, variable environments, not only fixed benchmarks.

Actionable checks teams can run

Instrument logs to join events across scans and privilege changes and evaluate simple history-based detectors.
Train or test detection rules against agents from stochastic simulators rather than a single static configuration.
Measure how long it takes current alerts to accumulate sufficient context and close that window by enriching telemetry and correlating across hosts.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Learning Robust Penetration-Testing Policies under Partial Observability: A systematic evaluation

Authors: Raphael Simon, Pieter Libin, and Wim Mees

Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing recurrent or transformer-based architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging three times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.

🔍 ShortSpan Analysis of the Paper

Problem

Penetration testing is framed as a sequential decision problem suitable for reinforcement learning, but partial observability breaks the Markov property, requiring history aggregation or belief state estimation to learn effective policies. The study investigates stochastic, partially observable penetration testing scenarios across host networks of varying size to better reflect real world complexity. The goal is to develop robust and transferable policies that perform reliably across diverse and unpredictable environments. The work compares several PPO variants designed to mitigate partial observability, including frame stacking, historical augmentation, and recurrent or transformer based architectures, assessing their generalisation across network sizes.

Approach

To address this, the authors adapt NASim into StochNASim, a partially observable and stochastic environment with variable network sizes. Each episode generates a new network with different host properties; observation space is expanded to handle networks up to eight hosts. The environment includes seven action types: Exploit, Privilege Escalation, Service Scan, Process Scan, OS Scan, Subnet Scan, No Op. An action has a cost and success probability; transitions depend on reachability and preconditions. The notification includes observation after each action with outcome. The action space scales with number of hosts; when fewer than maximum, No Op actions pad the space. The reward function rewards discovery and access on sensitive hosts and charges for actions. The experiments compare vanilla PPO and variants that address partial observability: PPO with Frame Stacking (PPO FS), Observation Augmentation (PPO AO), LSTM based PPO, and Transformer based PPO TrXL. Hyperparameters are tuned with Optuna over 250 trials per algorithm; evaluation uses 100 episodes per check over multiple control seeds, with a 5 million step training budget per run and a Median Pruner. Networks tested range 5 to 8 hosts; action space size is 96; exploit and escalation success probability is 0.9; costs are scan 1, exploit 3, escalation 3; step limit is 5000. NASim baseline uses fixed four sizes; StochNASim trains on diverse networks to test generalisation. The study also includes a transfer test showing policies learned in NASim transfer poorly to unseen configurations.

Key Findings

History matters: the task benefits greatly from history aggregation, converging three times faster than other approaches.
PPO AO, which augments observations with history, consistently achieves the highest performance across network sizes, and converges faster than the other methods whilst yielding larger cumulative rewards.
Frame stacking (PPO FS) and the LSTM variant perform intermediate or poorer results; PPO TrXL tends toward a brute force strategy of trying exploits sequentially, while PPO LSTM struggles to capture history efficiently.
Across network sizes, seven hosts often produce the highest mean rewards for most algorithms, while eight hosts increase complexity and degrade performance for all but plain PPO.
Qualitative analyses show PPO AO spends more time on scans and uses information efficiently; PPO TrXL shows little scanning; PPO LSTM shows inconsistent memories; PPO baseline distributes actions more evenly, indicating policy uncertainty without memory.
The study also finds that policies learned in a fixed NASim environment transfer poorly to new networks, whereas those trained in StochNASim generalise across configurations.

Limitations

Limitations include reliance on a simulated environment; dynamic network changes during episodes are not explored; handcrafted observation augmentation may be brittle; the configured diversity is limited to two operating systems, two services and two processes per category; and computational cost is nontrivial, especially for transformer and recurrent models; validation in real world remains to be demonstrated.

Why It Matters

Findings show that memoryless simple observation augmentation can outperform more complex memory architectures in partially observable penetration testing, highlighting the role of observability and memory in defending or attacking AI driven systems. The work provides benchmarking guidance for evaluating robustness of security systems against partially observed automated threats and emphasises the need for memory aware defence strategies. It also underscores the societal risk of AI powered offensive capabilities on critical networks, while offering actionable insights for improving detection and resilience through enhanced observability.

Attribution Original paper on arXiv