New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

SecureClaw clamps agent leaks and unauthorised actions

Published: Tue, Jun 09, 2026 • By James Armitage

Agents

SecureClaw clamps agent leaks and unauthorised actions

SecureClaw splits agent security across two boundaries: a trusted read gateway returns opaque handles and safe summaries, while a PREVIEW to COMMIT executor enforces authorisation on writes. In benchmarks it blocks effect attacks (0% ASB, 0.64% AgentDojo) and cuts internal leaks (3.23% on AgentLeak), while retaining high task utility.

Agent security keeps pretending that jailbreak spotters and final-output filters will save you. They will not. Tool-using systems fail in two places that most teams ignore: when agents make unauthorised external changes, and when they expose sensitive plaintext inside the runtime before any downstream check can intervene. SecureClaw focuses on both, which is why it matters.

How SecureClaw works

The design splits responsibilities cleanly. Put plaintext confinement at the read boundary; put authorisation at the effect sink. On reads, a trusted gateway holds protected fields and hands the agent opaque, Entropy" target="_blank" rel="noopener" class="term-link">high-entropy handles plus bounded deterministic summaries as an explicit declassification interface. The Large Language Model (LLM) can still plan over summaries and references, but it cannot dereference secrets.

On writes, the runtime only proposes. A policy engine authorises a canonical PREVIEW request and issues an artefact. A trusted executor alone can COMMIT the exact canonical request. If a commit is refused, deny-aware recovery nudges the agent down a fixed, safe template rather than letting it improvise its way into trouble.

What the data shows

The separation is not academic. Attacks love internal relay channels: inter-agent messages, shared memory, logs, and later tool calls. Opaque handles choke those paths. And if you think inline policy checks in the planner are enough to stop harmful effects, you are betting the farm on an untrusted runtime. The PREVIEW to COMMIT boundary shuts that door. The authors’ ablations show neither mechanism subsumes the other; you need both.

Results back it up: in a common harness, SecureClaw reports 0% attack success on Agent Security Bench, 0.64% on AgentDojo, and a 3.23% overall leak on AgentLeak’s attacked parity lane, which measures final-output and internal-relay leakage. Usability stays high, with 88.90% utility on ASB at 0% attack success. A bypass suite simulating a fully compromised runtime saw all 25 hostile commit attempts denied and four positive controls accepted. Executor checks added little latency relative to LLM inference time.

This is not magic. The gateway, handle store, policy engine and executor must be trusted. Every effectful sink must be mediated, protected fields must be classified correctly, and summaries do leak by design. Residual issues clustered where policy allowed misaligned actions and in final-output declassification, not in reopened internal channels.

My view: for tool-using LLM agents, a dual-boundary architecture is table stakes. The real work is now in policy granularity, schema audits and keeping the executor’s authority tight. If you cannot mediate sinks and classify data, do not ship agents that can touch anything you care about.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SecureClaw: Clawing Back Control of LLM Agents

Authors: Yuhan Ma and Stefan Schmid

Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies two distinct security failures in tool-using large language model agents: unauthorised external actions and exposure of sensitive plaintext inside the agent runtime before any final output check can intervene. Both surfaces matter because an agent may cause irreversible external effects or leak secrets via internal channels such as inter-agent messages, memory, logs, or later tool calls. Existing defences tend to protect only one boundary and so cannot by themselves prevent both unauthorised commits and runtime plaintext relay.

Approach

The authors propose SecureClaw, a dual-boundary architecture that enforces authorisation at the effect sink and confines plaintext at the read boundary. On reads a trusted gateway stores protected values in a handle store and returns opaque high-entropy handles to the untrusted runtime plus, where necessary for planning, bounded deterministic summaries treated as an explicit declassification interface. On writes the runtime may only propose actions; a policy engine authorises a canonicalised PREVIEW request and issues an authorization artifact, and a trusted executor is the sole component permitted to commit the exact canonical request via a PREVIEW to COMMIT protocol. Deny-aware recovery provides safe, fixed-template continuations when commits are refused. The design assumes complete mediation of effectful sinks, correct protected-field classification, freshness and replay protection, and trust in the gateway, handle store, policy engine and executor. The implementation used gpt-4o-mini-2024-07-18 at temperature zero for benchmark evaluation and compared SecureClaw against common-harness baselines configured identically for fair comparison.

Key Findings

Security outcomes: SecureClaw is the only evaluated same-harness defence that simultaneously attains near-zero attack success on effect benchmarks and low internal-leakage rates: 0% attack success rate on the Agent Security Bench, 0.64% on AgentDojo, and 3.23% overall leak on AgentLeak's attacked parity lane.
Usability: SecureClaw retains high task utility under attack, for example 88.90% utility on ASB at 0% attack success; deny-aware recovery raises matched-sample utility substantially without weakening enforcement.
Mechanism separation: Ablations show opaque handles are necessary to suppress internal-relay leakage (channels such as inter-agent messages and shared memory), while the executor-side PREVIEW to COMMIT boundary is necessary to prevent unauthorised commits; neither mechanism subsumes the other.
Robustness: A bypass suite that simulates a fully compromised runtime denied all adversarial commit attempts in 25 hostile cases and accepted 4 positive controls; executor verification latency is small relative to LLM inference time.
Residuals concentrate on policy-allowed misalignment and final-output declassification: remaining leaks and failures are mainly authorised-yet-misaligned actions or final-output channel leaks, not reopened internal- relay channels.

Limitations

SecureClaw requires trust in the gateway, handle store, policy engine and executor; it does not survive compromise of those components. It does not guarantee semantic correctness of actions that are authorised by policy, so residual failures reflect policy coarseness or misclassification of protected fields. The evaluated bounded-summary read interface intentionally trades some confidentiality for usability, and any leakage through that declassification plane remains the deployment's responsibility. Deployment also requires complete mediation of sinks, correct schema audits, and conservative confirmation for irreversible actions.

Implications

From an offensive security viewpoint, compromising only the runtime is insufficient to commit unauthorised external effects or to exfiltrate raw protected values if SecureClaw is correctly deployed. Attackers therefore must either compromise trusted components, exploit coarse policy bindings or craft inputs that force revealing summaries, or rely on social-engineering the human confirmation channel. The architecture shifts the adversary's profitable paths from direct runtime control to exploiting policy-allowed regions, declassification interfaces and deployment mistakes.

Links Original paper on arXiv

SecureClaw clamps agent leaks and unauthorised actions

How SecureClaw works

What the data shows

📋 Original Paper Title and Abstract

SecureClaw: Clawing Back Control of LLM Agents

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Watcher-led defence hardens OpenClaw autonomous agents

OpenClaw Case Study Exposes Real Risks in AI Agents

Claude Code Agent Design Exposes Real Attack Paths

Related Research

Get the weekly digest