Claude Code Agent Design Exposes Real Attack Paths
Agents
Agent systems love to brag about their clever reasoning loops. Claude Code’s loop is literally a while-true: call model, run tool, repeat. The interesting security work lives outside that loop, in the harness that decides whether tools run, what context the Large Language Model (LLM) sees, and which extensions get a say. That is where attackers aim.
The study shows a deny-first permission stack with seven modes, an ML classifier to choose auto or prompt-required actions, pre and post hooks, and an optional shell sandbox. Any single layer can block an action. Flip that around and you get the attacker’s playbook: find the layer that blinks under pressure. If the classifier misroutes to auto, if hooks run before checks, or if the sandbox is off or porous, you are a shell command away from real impact. The most concrete signal: researchers previously found a pre-trust initialisation window that let privileged code run before the full pipeline engaged. Multiple CVEs, now patched, but the class of bug is timeless: unsafe ordering during startup.
Extensibility is the real blast radius
Claude Code aggregates tools via MCP servers, plugins, skills and hooks, with builds showing on the order of a few dozen tools. That is a wide funnel. A compromised plugin or poisoned MCP server does not need prompt-injection genius; it needs a cooperative permission outcome and a moment when the harness trusts the extension’s output. The paper calls out hook pipelines and deferred schema loading as complexity multipliers. Complexity is where timing bugs and policy gaps breed.
Context is the binding resource. There is a five-layer compaction pipeline plus lazy loading to keep tokens in budget, from older ~200k contexts to about 1M in newer Claude 4.6 series. Compaction and lazy fetches mean the model’s actual input is a moving target. Attackers like moving targets because they lower operator predictability and make audits harder to reason about after the fact.
Subagent delegation is tidy on paper: spawn a subagent with its own worktree or remote isolation, keep a sidechain transcript, return a short summary to the parent to save context. From an attacker’s perspective, that summary boundary is cover. If the parent only ever sees a digest, the messy bits live in a parallel log that fewer humans read, especially under incident pressure.
Gateway answers differ, risks do not
Compared with OpenClaw’s gateway model, the trust shape changes. Claude Code leans on per-action checks inside a session harness. OpenClaw pushes identity and access control to the perimeter and a control plane. In the former, I target the classifier and tool orchestration. In the latter, I go straight for the gateway’s control surface. Same questions, different blast radii.
The unresolved edges are where this gets interesting: how reliable are ML gatekeepers when tokens spike, how often do hooks fire before policy, and what exactly happens in the first seconds of initialisation when every extension wants to be helpful? Those are not academic queries; they are 3am questions.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
🔍 ShortSpan Analysis of the Paper
Problem
The paper analyses the architecture of Claude Code, an agentic coding tool that can run shell commands, edit files and call external services, and contrasts it with OpenClaw, an open-source multi-channel agent gateway. It documents how design choices address recurring questions in agent systems - where reasoning runs, how safety is enforced, how context is managed, how extensibility is structured, how work is delegated, and how sessions persist - and why these choices matter for security, reliability and human control.
Approach
Source-level analysis of the publicly available TypeScript package for Claude Code (v2.1.88) was used to map components and code paths. The study organises findings around five motivating human values and thirteen design principles, traces a representative task through the system, and contrasts Claude Code with OpenClaw to show how deployment context changes architectural answers. Evidence tiers distinguish product documentation, code-verified claims and reconstructed inferences.
Key Findings
- Architecture and harness: Claude Code centres on a simple reactive while-loop that calls the model, runs tools and repeats, but most implementation effort is in surrounding infrastructure - permissions, context compaction, extensibility and persistence.
- Layered safety: A deny-first permission system with up to seven modes, an ML auto-mode classifier, Pre/Post hook pipeline and optional shell sandbox provide defence in depth; any single layer can block actions.
- Context management: Context is the binding resource. A five-layer compaction pipeline (budget reduction, snip, microcompact, context collapse, auto-compact) plus lazy loading of CLAUDE.md files limits token use; older models had ~200k context, newer Claude 4.6 series support ~1M.
- Extensibility surface: Four mechanisms - MCP servers, plugins, skills and hooks - expose different context costs and trade-offs; assembleToolPool merges built-in and external tools, up to about 54 tools in some builds.
- Subagent delegation and isolation: AgentTool spawns isolated subagents with worktree, remote or in-process isolation; subagents write sidechain transcripts and return summaries only to conserve parent context.
- Persistence and auditability: Session transcripts are mostly append-only JSONL files, enabling reconstruction and audit while deliberately not restoring session-scoped permissions on resume to avoid carrying stale trust.
- Deployment-sensitive trade-offs: Compared with OpenClaw, Claude Code opts for per-action safety evaluation and rich per-session harnessing, whereas OpenClaw prioritises perimeter identity and gateway-level access control, illustrating how trust model and scope change architectural choices.
- Known temporal vulnerability: Independent researchers found pre-trust initialisation ordering issues that created a privileged window before the full permission pipeline engaged; multiple CVEs were disclosed and patched.
Limitations
Analysis is a static snapshot of a specific code release and feature-flagged builds; runtime behaviour, enabled flags and production telemetry are not directly observable. Reverse engineering infers intent from implementation but cannot prove deployment prevalence. Some conclusions derive from reconstructed or community-sourced evidence rather than direct vendor statements.
Implications
Offensive security implications are concrete. Command execution, shell tools and plugin or MCP integration expand attack surface: compromised or malicious tools, poisoned MCP servers or crafted plugins could cause arbitrary filesystem or network actions if permission checks or sandboxing are bypassed. The pre-trust initialisation window shows an exploitable temporal attack surface where extensions can run before interactive trust is established. Hooks and deferred schema loading create complex, interacting paths that complicate threat modelling. Subagent delegation and summary-only returns reduce observability and can enable stealthy parallel attacks if sidechains or summaries hide malicious activity. Session persistence choices - append-only transcripts but non-restored permissions - mean attackers cannot rely on implicit trust carry-over, but initialisation and extension loading remain high-risk phases. Deployment context matters: perimeter-focused gateways reduce per-action checks but centralise risk; per-action classification gives fine-grained control but relies on layered mechanisms that may share failure modes. Attackers can exploit any layer that degrades under performance pressure. These observations underscore the need to scrutinise initialisation ordering, plugin/MCP trust, hook lifecycles, sandbox boundaries, and the interactions between automated classifiers and rule engines when assessing or attacking agentic systems.