ShortSpan.ai logo

Benchmarking OpenClaw Agents Exposes Lifecycle Security Risks

Agents
Published: Mon, Apr 06, 2026 • By Elise Veyron
Benchmarking OpenClaw Agents Exposes Lifecycle Security Risks
A new study stress-tests six OpenClaw-series Large Language Model (LLM) agent frameworks with a 205-case benchmark and finds agents are riskier than base models. Reconnaissance succeeds often, early weaknesses amplify later, and distinct frameworks show different high‑risk profiles. The authors argue for lifecycle governance, not prompt-only fixes.

Enterprises are racing to deploy Large Language Model (LLM) agents that plan tasks, call tools and keep working context. That combination turns a chat model into a system component with real reach. A new systematic evaluation of six OpenClaw-series agent frameworks shows how much that shift matters for security: agents are measurably riskier than the backbone models they wrap.

The researchers built a 205‑case benchmark spanning 13 attack categories across the agent lifecycle, then ran more than 1,200 adversarial interactions against OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw and ArkClaw, paired with multiple backbone models. They tracked how risks propagate from input ingestion through planning and tool execution to final output.

Three findings stand out. First, agentisation enlarges the attack surface. Tool invocation, multi‑step planning and persistent state increased the probability that a risky action would succeed compared with the model used in isolation. Second, reconnaissance and discovery were the most common weaknesses, often succeeding above 65%. Early probes surfaced environment details that later enabled privilege changes, lateral movement or data theft. Third, each framework exhibited a distinct high‑risk profile: QClaw was prone to credential access (85.71%) and data exfiltration (80.00%); KimiClaw showed higher lateral movement (66.67%); AutoClaw reached 70.00% in privilege escalation and 71.43% in resource development. MaxClaw was comparatively resilient with an overall 16.10% success rate; ArkClaw and OpenClaw variants sat between these poles (ArkClaw overall 29.76%).

The study also catalogues concrete vectors that cut across frameworks. Symbolic‑link and path traversal bugs let agents escape sandboxes. Some agents echoed credentials in plaintext outputs. Encoded or semantically obfuscated payloads bypassed input filters, only to be decoded during planning and executed by tools. Several frameworks pulled unaudited external resources, importing hostile code. Once an agent holds execution capability and state, a small early lapse becomes a systemic failure.

For security teams, the message is practical: treat agent runtimes as untrusted execution environments and design controls for the whole lifecycle, not just the prompt. The authors point to specific mitigations that reduce amplification risk. Multi‑layer input decoding and semantic inspection make it harder for encoded payloads to slip through plan stages. Task‑intent centred plan reviews, with mandatory human confirmation for high‑privilege actions, slow down dangerous chains before tools fire. Strict physical path resolution and sandbox enforcement on tool execution contain file‑system abuse, while dynamic output desensitisation and outbound controls prevent credential leakage and covert exfiltration.

These are engineering steps, but they are also governance steps. If an agent can write files, spawn processes or reach external networks, those capabilities should be permissioned, logged and bounded. High‑risk plans should face explicit approval, not implicit trust. In time, I expect procurement and audit standards to ask for exactly this evidence: lifecycle testing across planning and execution, red‑teaming beyond prompt injection, and telemetry that proves containment works in production.

There are caveats. The benchmark covers six related frameworks and selected model pairings. Different architectures, tools or configurations may shift the risk profile. Still, the pattern is clear: coupling model capability with tools and orchestration drives new failure modes that model‑only evaluations miss.

The policy implication is straightforward. We cannot regulate agent safety by focusing on input prompts alone. Assurance needs to follow the agent from intent formation to tool calls and outputs, with controls that assume early‑stage reconnaissance will happen and seek to prevent its amplification. If we build and require that kind of lifecycle discipline, agents can earn their place in sensitive environments without asking for blind trust.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

A Systematic Security Evaluation of OpenClaw and Its Variants

Authors: Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, and Shiguo Lian
Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.

🔍 ShortSpan Analysis of the Paper

Problem

This paper examines security risks introduced when large language models are embedded into tool-augmented agent frameworks. The study focuses on system-level vulnerabilities that do not appear in model-only evaluations, showing how tool access, multi-step planning, execution and persistent state enlarge the attack surface and enable real-world compromises.

Approach

The authors systematically evaluated six OpenClaw-series agent frameworks (OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw and ArkClaw) paired with multiple backbone models. They constructed a benchmark of 205 test cases across 13 attack categories that span the agent lifecycle and executed over 1,200 adversarial interaction rounds to measure how risks propagate through input ingestion, planning and reasoning, tool execution and result return.

Key Findings

  • Agentised systems are riskier than backbone models alone: coupling models with tool invocation, multi-step reasoning and state persistence markedly increases the effective attack surface and probability of successful high‑risk actions.
  • Reconnaissance and discovery dominate failures: early-stage probing and environment enumeration had consistently high success rates, often exceeding 65%, allowing attackers to gather intelligence that facilitates later stages.
  • Frameworks show distinct high‑risk signatures: QClaw exhibited high credential access (85.71%) and data exfiltration (80.00%); KimiClaw showed strong lateral movement (66.67%); AutoClaw reached 70.00% in privilege escalation and 71.43% in resource development; MaxClaw was comparatively resilient with an overall 16.10% success rate; ArkClaw and OpenClaw variants had intermediate profiles (ArkClaw overall 29.76%).
  • Cross‑stage amplification is common: successful early probes often enabled privilege escalation, persistence or exfiltration later in the chain; failures at input or planning stages were amplified once execution and persistent state were available.
  • Concrete exploitable vectors identified: symbolic‑link and path traversal escaping sandbox boundaries, plaintext echoing of credentials in outputs, semantic obfuscation and encoded payloads bypassing input filters, encoded command decoding during planning, and unaudited external resource pulls that introduce hostile code.

Limitations

The evaluation covers six representative Claw‑series frameworks and multiple backbone models using a 205‑case benchmark and does not claim exhaustive coverage of all agent designs, models or deployment variations. Results therefore reflect behaviour under the tested architectures, toolsets and threat scenarios; security posture may vary with different integrations, configurations or defensive controls.

Why It Matters

The study demonstrates that protecting intelligent agents requires lifecycle‑wide governance rather than prompt‑level fixes. Practical mitigations proposed include multi‑layer input decoding and semantic inspection, task‑intent centred plan review and mandatory human confirmation for high‑privilege plans, strict physical path resolution and sandbox enforcement for tool execution, and dynamic output desensitisation and outbound control to prevent credential leakage and covert exfiltration. These measures are critical for reducing the risk that early reconnaissance converts into system‑level compromise in real deployments.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.