LLM agents automate ROS pentesting with graph memory
Pentesting
Operational technology security folk love to say robots are just computers with arms until those arms start moving. Pentesting Robot Operating System (ROS) and ROS2 networks is messy, context heavy and easy to derail with naive automation. A new paper tackles this with a multi agent workflow that keeps the language model on a short leash and writes everything down.
What they built
The authors use a Large Language Model (LLM) planner, an executor and a memory agent, stitched together with LangGraph. The planner breaks work into structured tasks. The executor runs a small, deterministic toolset such as Nmap and bespoke Bash scripts aimed at ROS enumeration and exploitation. The memory agent records what happened into a graph based store they call GraphMemory. That memory captures hosts, services, communications, vulnerabilities, attempted exploits and Provenance" target="_blank" rel="noopener" class="term-link">task provenance, and it persists across steps so the agents do not forget what they already learned.
To stop the system wandering off, they define operational modes such as scanning and ROS exploitation. A human operator flips the mode switch, which bounds what the executor is allowed to do. This aligns with traceability and human oversight requirements that frameworks like the EU AI Act are pushing towards.
Evaluation took place in a Docker based ROS and ROS2 capture the flag scenario that emulates a small manufacturing network: ROS containers, a bridge, a PLC data stream and a simulated robot arm. Across reported runs, the system completed the challenge in 100 percent of cases with n equal to 5, and it outperformed baselines from the literature. The authors note that combining LLM planning with deterministic, rule based parsing helped. Planner models including llama 3.3 70b instruct and deepseek v3.2 managed to enumerate ROS topics and read live messages when console output was parsed by scripts, while other models failed earlier by missing ROS master discovery.
Why it matters (and why it might not)
This is not another agent demo that roleplays a hacker until it gets stuck in a loop. The graph memory and hard mode boundaries address common failure modes like context loss, task derailment and reasoning action mismatch. More importantly, the work treats auditability as a first class requirement. For anyone trying to document a red team in a safety critical stack, a replayable graph of actions and outcomes is practical, not performative.
Now the cold water. Five runs in a curated, containerised CTF are not a proxy for a production OT floor. Real robots bring timing jitter, flaky networks, safety interlocks and change control. Many ROS deployments live behind segmentation, and discovery can fail in less tidy topologies. The code is not publicly released, which slows scrutiny and replication. Even in this setup, the agents sometimes produced convoluted or suboptimal commands, which wastes time and risks noise on fragile systems. And the success hinged on deterministic scripts tailored to ROS. Without those crutches, the LLMs faltered.
So does this matter? If you run ROS or ROS2, yes, as a pattern. Treat the LLM as a planner on top of narrow, well tested tools. Persist environment knowledge in a graph so the agent does not thrash. Keep mode based guardrails with a human deciding when to escalate. That recipe looks workable for repeatable, auditable testing in robotics. But do not read this as a general purpose AI pentester for OT. It is a disciplined harness around known techniques, tested in a toy but relevant environment. Valuable, incremental and worth watching, provided we do not mistake it for autonomy in the wild.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies automated penetration testing for robotic operational technology, where modern robots are networked cyber-physical systems that expose attack surfaces through middleware such as ROS and ROS2. Traditional IT testing approaches and single large language models struggle in these environments because of limited long-term memory, weak role separation, poor traceability, and the need to preserve operational safety and human oversight. Emerging regulatory requirements demand auditable decision traces and effective human control, so scalable, repeatable and transparent security assessment methods are required for robotics deployments.
Approach
The authors propose an environment-grounded multi-agent workflow built on LangGraph, with three cooperative agents: Planner, Executor and Memory Agent. The Planner produces structured tasks, the Executor translates tasks into commands and runs a small toolset (notably Nmap and custom Bash scripts for ROS exploitation), and the Memory Agent persists outcomes in a shared graph-based knowledge base called GraphMemory. The system maintains a transient State for synchronising agents and a persistent GraphMemory that records discovered hosts, services, attempted exploits and task provenance. Operational Modes (for example Scanning Mode and ROS Exploitation Mode) bound permissible actions and keep a human-in-the-loop to switch modes. Evaluation uses a Docker-based ROS/ROS2 testbed that emulates a robot manufacturing network comprising ROS containers, a bridge, a PLC data stream and a simulated industrial robot arm, and assesses performance in a Capture-the-Flag scenario with staged reconnaissance and exploitation goals.
Key Findings
- The proposed workflow completed the specialised ROS/ROS2 CTF reliably, achieving full success across the reported test runs (100% success reported for n=5 in the abstract) and outperforming state-of-the-art baselines in the literature.
- The GraphMemory and agent separation improved traceability and controllability: actions, responsible agents and outcomes are recorded, enabling reconstruction of the attack path and supporting compliance-oriented documentation and human oversight.
- Combining LLM planning with deterministic, rule-based extraction improved exploitation success: two planner models (llama-3.3-70b-instruct and deepseek-v3.2) were able to enumerate ROS topics and read live messages when the workflow integrated script-based parsing of console output, while other models failed earlier by missing the ROS master discovery.
- Common failure modes observed in baseline multi-agent setups include reasoning–action mismatch, task derailment and step repetition; the proposed architecture mitigates some of these by maintaining persistent context and enforcing mode and tool constraints.
Limitations
Experiments were conducted in a controlled, containerised testbed that simplifies real-world complexity. Code is not publicly released for ethical reasons but is available from the corresponding author on request. Some agents still produce overly complex or suboptimal commands, which reduces efficiency. Evaluation used a small number of runs per model and focused on ROS-specific scenarios, so generalisation to broader OT stacks and larger, long-running engagements remains to be demonstrated.
Why It Matters
The work demonstrates a practicable path to scalable, auditable automation of penetration testing in robotic environments by combining LLM-driven planning with a persistent, environment-grounded memory and deterministic tooling. This approach can accelerate repeatable security assessments, improve documentation for regulatory compliance and support safer human oversight. At the same time, the results highlight remaining challenges in tooling efficiency, model controllability and the need for further evaluation across diverse operational technology contexts.