SecureClaw clamps agent leaks and unauthorised actions
Agents
Agent security keeps pretending that jailbreak spotters and final-output filters will save you. They will not. Tool-using systems fail in two places that most teams ignore: when agents make unauthorised external changes, and when they expose sensitive plaintext inside the runtime before any downstream check can intervene. SecureClaw focuses on both, which is why it matters.
How SecureClaw works
The design splits responsibilities cleanly. Put plaintext confinement at the read boundary; put authorisation at the effect sink. On reads, a trusted gateway holds protected fields and hands the agent opaque, Entropy" target="_blank" rel="noopener" class="term-link">high-entropy handles plus bounded deterministic summaries as an explicit declassification interface. The Large Language Model (LLM) can still plan over summaries and references, but it cannot dereference secrets.
On writes, the runtime only proposes. A policy engine authorises a canonical PREVIEW request and issues an artefact. A trusted executor alone can COMMIT the exact canonical request. If a commit is refused, deny-aware recovery nudges the agent down a fixed, safe template rather than letting it improvise its way into trouble.
What the data shows
The separation is not academic. Attacks love internal relay channels: inter-agent messages, shared memory, logs, and later tool calls. Opaque handles choke those paths. And if you think inline policy checks in the planner are enough to stop harmful effects, you are betting the farm on an untrusted runtime. The PREVIEW to COMMIT boundary shuts that door. The authors’ ablations show neither mechanism subsumes the other; you need both.
Results back it up: in a common harness, SecureClaw reports 0% attack success on Agent Security Bench, 0.64% on AgentDojo, and a 3.23% overall leak on AgentLeak’s attacked parity lane, which measures final-output and internal-relay leakage. Usability stays high, with 88.90% utility on ASB at 0% attack success. A bypass suite simulating a fully compromised runtime saw all 25 hostile commit attempts denied and four positive controls accepted. Executor checks added little latency relative to LLM inference time.
This is not magic. The gateway, handle store, policy engine and executor must be trusted. Every effectful sink must be mediated, protected fields must be classified correctly, and summaries do leak by design. Residual issues clustered where policy allowed misaligned actions and in final-output declassification, not in reopened internal channels.
My view: for tool-using LLM agents, a dual-boundary architecture is table stakes. The real work is now in policy granularity, schema audits and keeping the executor’s authority tight. If you cannot mediate sinks and classify data, do not ship agents that can touch anything you care about.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
SecureClaw: Clawing Back Control of LLM Agents
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies two distinct security failures in tool-using large language model agents: unauthorised external actions and exposure of sensitive plaintext inside the agent runtime before any final output check can intervene. Both surfaces matter because an agent may cause irreversible external effects or leak secrets via internal channels such as inter-agent messages, memory, logs, or later tool calls. Existing defences tend to protect only one boundary and so cannot by themselves prevent both unauthorised commits and runtime plaintext relay.
Approach
The authors propose SecureClaw, a dual-boundary architecture that enforces authorisation at the effect sink and confines plaintext at the read boundary. On reads a trusted gateway stores protected values in a handle store and returns opaque high-entropy handles to the untrusted runtime plus, where necessary for planning, bounded deterministic summaries treated as an explicit declassification interface. On writes the runtime may only propose actions; a policy engine authorises a canonicalised PREVIEW request and issues an authorization artifact, and a trusted executor is the sole component permitted to commit the exact canonical request via a PREVIEW to COMMIT protocol. Deny-aware recovery provides safe, fixed-template continuations when commits are refused. The design assumes complete mediation of effectful sinks, correct protected-field classification, freshness and replay protection, and trust in the gateway, handle store, policy engine and executor. The implementation used gpt-4o-mini-2024-07-18 at temperature zero for benchmark evaluation and compared SecureClaw against common-harness baselines configured identically for fair comparison.
Key Findings
- Security outcomes: SecureClaw is the only evaluated same-harness defence that simultaneously attains near-zero attack success on effect benchmarks and low internal-leakage rates: 0% attack success rate on the Agent Security Bench, 0.64% on AgentDojo, and 3.23% overall leak on AgentLeak's attacked parity lane.
- Usability: SecureClaw retains high task utility under attack, for example 88.90% utility on ASB at 0% attack success; deny-aware recovery raises matched-sample utility substantially without weakening enforcement.
- Mechanism separation: Ablations show opaque handles are necessary to suppress internal-relay leakage (channels such as inter-agent messages and shared memory), while the executor-side PREVIEW to COMMIT boundary is necessary to prevent unauthorised commits; neither mechanism subsumes the other.
- Robustness: A bypass suite that simulates a fully compromised runtime denied all adversarial commit attempts in 25 hostile cases and accepted 4 positive controls; executor verification latency is small relative to LLM inference time.
- Residuals concentrate on policy-allowed misalignment and final-output declassification: remaining leaks and failures are mainly authorised-yet-misaligned actions or final-output channel leaks, not reopened internal- relay channels.
Limitations
SecureClaw requires trust in the gateway, handle store, policy engine and executor; it does not survive compromise of those components. It does not guarantee semantic correctness of actions that are authorised by policy, so residual failures reflect policy coarseness or misclassification of protected fields. The evaluated bounded-summary read interface intentionally trades some confidentiality for usability, and any leakage through that declassification plane remains the deployment's responsibility. Deployment also requires complete mediation of sinks, correct schema audits, and conservative confirmation for irreversible actions.
Implications
From an offensive security viewpoint, compromising only the runtime is insufficient to commit unauthorised external effects or to exfiltrate raw protected values if SecureClaw is correctly deployed. Attackers therefore must either compromise trusted components, exploit coarse policy bindings or craft inputs that force revealing summaries, or rely on social-engineering the human confirmation channel. The architecture shifts the adversary's profitable paths from direct runtime control to exploiting policy-allowed regions, declassification interfaces and deployment mistakes.