ShortSpan.ai logo

OpenClaw agent state poisoning punches through safety

Agents
Published: Tue, Apr 07, 2026 • By Elise Veyron
OpenClaw agent state poisoning punches through safety
A live study of OpenClaw, a widely deployed personal agent, shows how poisoning its persistent state drives attack success far above baseline. A single Capability, Identity or Knowledge tweak lifts success from 24.6% to 64–74%. Executable skills bypass safeguards. Defences help but cripple evolution, exposing an architectural fault line.

Personal agents that live on your machine and talk to your email and payments are finally here. OpenClaw, billed as the most widely deployed personal agent in early 2026, runs locally with full system access and hooks into Gmail, Stripe and the filesystem. That power comes with a predictable consequence: the attack surface is not a prompt window, it is your computer.

How it breaks

The study frames the agent’s persistent state as Capability (executable skills), Identity (persona, trust and behavioural settings) and Knowledge (long‑term memory). Attackers do not need to win an argument with a Large Language Model (LLM). They poison one of these three stores in Phase 1, then a normal later session triggers the harm in Phase 2.

Baseline fragility is real: without any poisoning, attack success rates (ASR) sit between 10.0% and 36.7% across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, GPT‑5.4), averaging 24.6%. Once the team modifies a single state dimension, ASR jumps to 64–74%. Knowledge poisoning is worst at 74.4%, Capability comes in at 68.3%, Identity at 64.3%.

Capability attacks are brutally simple. The agent installs a “skill” that is just bash or Python. Phase 1 succeeds 100% of the time because the payload becomes a local executable. In Phase 2, the agent dutifully calls it, often delivering end‑to‑end success at or above 77%. This sidesteps the model’s reasoning loop entirely. Knowledge and Identity poisoning land with high reliability too (Phase 1 acceptance 87.5–100% and 65.4–96.2% respectively).

These results hold across models. The most robust model, Opus 4.6, climbs from a 10.0% baseline to an average 44.2% ASR after poisoning. Sonnet 4.5 with Knowledge poisoning peaks at 89.2%. The impact scenarios span privacy leakage and risky irreversible operations. With real Gmail and Stripe (test mode) in the loop, the harms look uncomfortably practical.

Defences stumble on the fundamentals

A Capability‑layer guard (GuardianClaw) trims baseline ASR to 1.7% and blunts some attacks, yet Capability‑targeted attacks still reach 63.8%. Identity and Knowledge augmentations reduce baseline exposure but leave average attack ASR at 59.9% and 49.6%. File protection is the blunt instrument: it slashes injection acceptance from 87.0% to 5.0% but also throttles benign updates from 100% to below 13.2%.

The pattern is clear. When agents evolve by writing their own state and installing code, the vulnerability is structural, not a quirk of any single LLM. The authors point to architectural safeguards such as code signing, sandboxed execution, runtime monitoring and mandatory human approval. The open question is how to preserve useful autonomy without handing attackers durable, privileged footholds in the very memory that makes these agents helpful.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Authors: Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, and Cihang Xie
OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc-vlaa.github.io/CIK-Bench.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies real-world safety risks in OpenClaw, a widely deployed personal AI agent that runs locally with full system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. The agent persistently evolves by writing and reading local files, enabling powerful automation and personalised behaviour but also creating a broad attack surface that sandboxed or simulated evaluations miss. The authors argue that manipulating the agent’s persistent state can produce lasting, harmful changes and that current defences are insufficient.

Approach

The authors introduce the CIK taxonomy to structure persistent state into three dimensions: Capability (executable skills), Identity (persona, trust and behavioural configuration), and Knowledge (long‑term memory). They performed the first live evaluation of a deployed OpenClaw instance integrated with real Gmail, Stripe (test mode) and the local filesystem, across four backbone models: Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro and GPT‑5.4. They designed 12 impact scenarios covering privacy leakage and risky irreversible operations, and ran a two‑phase attack protocol where Phase 1 injects poisoned content into persistent files and Phase 2 triggers the harmful action in a later session. Each model was tested on 88 cases (12 baselines and 76 injections) and metrics are averaged over five independent runs. The study also evaluates three CIK‑aligned defences and a file‑protection mechanism.

Key Findings

  • Baseline vulnerability exists: without poisoning, attack success rates range from 10.0% to 36.7% across models, showing native safety alignment is incomplete.
  • POISONING EFFECT: Introducing poisoned state in any single CIK dimension raises average attack success markedly, from a reported overall baseline mean of 24.6% to between 64% and 74% depending on dimension. Knowledge poisoning attains the highest mean ASR at 74.4%, Capability reaches 68.3% and Identity 64.3%.
  • Structural rather than model‑specific weakness: even the most robust model (Opus 4.6) increases from a 10.0% baseline to an average 44.2% ASR after poisoning, and Sonnet 4.5 with Knowledge poisoning reaches 89.2% ASR.
  • Phase breakdown: Capability injections always succeed in Phase 1 (100% deterministic) because skill installation deposits executables; Knowledge Phase 1 acceptance is 87.5%–100% and Identity Phase 1 varies 65.4%–96.2%. Phase 2 shows executable payloads (skill bash/py) yield the most reliable end‑to‑end attacks, often >=77%.
  • Defence performance limited: a capability defence (GuardianClaw) cut baseline ASR to 1.7% and reduced some attacks, but Capability‑targeted attacks still reached 63.8%. Identity and Knowledge augmentations lowered baseline ASR but left attack ASR substantial (average 59.9% and 49.6% respectively).
  • File protection blocks many injections but halts legitimate evolution: file protection reduced attack injections from 87.0% to 5.0% while also preventing legitimate updates, dropping acceptance of benign prompts from 100% to below 13.2%, exposing an evolution–safety tradeoff.

Limitations

The study covers one agent platform with four backbone models and 12 manually designed scenarios; cross‑dimension chaining attacks, additional platforms, automated attack generation and longitudinal user studies were not explored. Experiments used researcher‑controlled accounts and test mode for Stripe, and filesystem operations were confined to a test workspace.

Why It Matters

The work demonstrates that persistent state poisoning is a systemic risk for evolution‑first personal agents. Capability attacks that execute code bypass the model’s reasoning loop and therefore demand architectural safeguards such as code signing, sandboxed execution, runtime monitoring or mandatory human approval. Contextual defences help but do not eliminate risks, and protecting files threatens the agent’s ability to learn. The CIK taxonomy offers a practical threat model for designing and evaluating more robust security architectures for personal AI agents.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.