OpenClaw agent state poisoning punches through safety
Agents
Personal agents that live on your machine and talk to your email and payments are finally here. OpenClaw, billed as the most widely deployed personal agent in early 2026, runs locally with full system access and hooks into Gmail, Stripe and the filesystem. That power comes with a predictable consequence: the attack surface is not a prompt window, it is your computer.
How it breaks
The study frames the agent’s persistent state as Capability (executable skills), Identity (persona, trust and behavioural settings) and Knowledge (long‑term memory). Attackers do not need to win an argument with a Large Language Model (LLM). They poison one of these three stores in Phase 1, then a normal later session triggers the harm in Phase 2.
Baseline fragility is real: without any poisoning, attack success rates (ASR) sit between 10.0% and 36.7% across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, GPT‑5.4), averaging 24.6%. Once the team modifies a single state dimension, ASR jumps to 64–74%. Knowledge poisoning is worst at 74.4%, Capability comes in at 68.3%, Identity at 64.3%.
Capability attacks are brutally simple. The agent installs a “skill” that is just bash or Python. Phase 1 succeeds 100% of the time because the payload becomes a local executable. In Phase 2, the agent dutifully calls it, often delivering end‑to‑end success at or above 77%. This sidesteps the model’s reasoning loop entirely. Knowledge and Identity poisoning land with high reliability too (Phase 1 acceptance 87.5–100% and 65.4–96.2% respectively).
These results hold across models. The most robust model, Opus 4.6, climbs from a 10.0% baseline to an average 44.2% ASR after poisoning. Sonnet 4.5 with Knowledge poisoning peaks at 89.2%. The impact scenarios span privacy leakage and risky irreversible operations. With real Gmail and Stripe (test mode) in the loop, the harms look uncomfortably practical.
Defences stumble on the fundamentals
A Capability‑layer guard (GuardianClaw) trims baseline ASR to 1.7% and blunts some attacks, yet Capability‑targeted attacks still reach 63.8%. Identity and Knowledge augmentations reduce baseline exposure but leave average attack ASR at 59.9% and 49.6%. File protection is the blunt instrument: it slashes injection acceptance from 87.0% to 5.0% but also throttles benign updates from 100% to below 13.2%.
The pattern is clear. When agents evolve by writing their own state and installing code, the vulnerability is structural, not a quirk of any single LLM. The authors point to architectural safeguards such as code signing, sandboxed execution, runtime monitoring and mandatory human approval. The open question is how to preserve useful autonomy without handing attackers durable, privileged footholds in the very memory that makes these agents helpful.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies real-world safety risks in OpenClaw, a widely deployed personal AI agent that runs locally with full system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. The agent persistently evolves by writing and reading local files, enabling powerful automation and personalised behaviour but also creating a broad attack surface that sandboxed or simulated evaluations miss. The authors argue that manipulating the agent’s persistent state can produce lasting, harmful changes and that current defences are insufficient.
Approach
The authors introduce the CIK taxonomy to structure persistent state into three dimensions: Capability (executable skills), Identity (persona, trust and behavioural configuration), and Knowledge (long‑term memory). They performed the first live evaluation of a deployed OpenClaw instance integrated with real Gmail, Stripe (test mode) and the local filesystem, across four backbone models: Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro and GPT‑5.4. They designed 12 impact scenarios covering privacy leakage and risky irreversible operations, and ran a two‑phase attack protocol where Phase 1 injects poisoned content into persistent files and Phase 2 triggers the harmful action in a later session. Each model was tested on 88 cases (12 baselines and 76 injections) and metrics are averaged over five independent runs. The study also evaluates three CIK‑aligned defences and a file‑protection mechanism.
Key Findings
- Baseline vulnerability exists: without poisoning, attack success rates range from 10.0% to 36.7% across models, showing native safety alignment is incomplete.
- POISONING EFFECT: Introducing poisoned state in any single CIK dimension raises average attack success markedly, from a reported overall baseline mean of 24.6% to between 64% and 74% depending on dimension. Knowledge poisoning attains the highest mean ASR at 74.4%, Capability reaches 68.3% and Identity 64.3%.
- Structural rather than model‑specific weakness: even the most robust model (Opus 4.6) increases from a 10.0% baseline to an average 44.2% ASR after poisoning, and Sonnet 4.5 with Knowledge poisoning reaches 89.2% ASR.
- Phase breakdown: Capability injections always succeed in Phase 1 (100% deterministic) because skill installation deposits executables; Knowledge Phase 1 acceptance is 87.5%–100% and Identity Phase 1 varies 65.4%–96.2%. Phase 2 shows executable payloads (skill bash/py) yield the most reliable end‑to‑end attacks, often >=77%.
- Defence performance limited: a capability defence (GuardianClaw) cut baseline ASR to 1.7% and reduced some attacks, but Capability‑targeted attacks still reached 63.8%. Identity and Knowledge augmentations lowered baseline ASR but left attack ASR substantial (average 59.9% and 49.6% respectively).
- File protection blocks many injections but halts legitimate evolution: file protection reduced attack injections from 87.0% to 5.0% while also preventing legitimate updates, dropping acceptance of benign prompts from 100% to below 13.2%, exposing an evolution–safety tradeoff.
Limitations
The study covers one agent platform with four backbone models and 12 manually designed scenarios; cross‑dimension chaining attacks, additional platforms, automated attack generation and longitudinal user studies were not explored. Experiments used researcher‑controlled accounts and test mode for Stripe, and filesystem operations were confined to a test workspace.
Why It Matters
The work demonstrates that persistent state poisoning is a systemic risk for evolution‑first personal agents. Capability attacks that execute code bypass the model’s reasoning loop and therefore demand architectural safeguards such as code signing, sandboxed execution, runtime monitoring or mandatory human approval. Contextual defences help but do not eliminate risks, and protecting files threatens the agent’s ability to learn. The CIK taxonomy offers a practical threat model for designing and evaluating more robust security architectures for personal AI agents.