ShortSpan.ai logo

Claude Code Agent Design Exposes Real Attack Paths

Agents
Published: Wed, Apr 15, 2026 • By Lydia Stratus
Claude Code Agent Design Exposes Real Attack Paths
New analysis tears into Claude Code and OpenClaw agent architectures. The core agent loop is simple; the risk lives in permissions, classifiers, plugins, MCP servers and subagents. A patched initialisation gap shows timing bugs are real. The paper maps concrete paths for command execution abuse, stealthy subagent work, and audit gaps.

Agent systems love to brag about their clever reasoning loops. Claude Code’s loop is literally a while-true: call model, run tool, repeat. The interesting security work lives outside that loop, in the harness that decides whether tools run, what context the Large Language Model (LLM) sees, and which extensions get a say. That is where attackers aim.

The study shows a deny-first permission stack with seven modes, an ML classifier to choose auto or prompt-required actions, pre and post hooks, and an optional shell sandbox. Any single layer can block an action. Flip that around and you get the attacker’s playbook: find the layer that blinks under pressure. If the classifier misroutes to auto, if hooks run before checks, or if the sandbox is off or porous, you are a shell command away from real impact. The most concrete signal: researchers previously found a pre-trust initialisation window that let privileged code run before the full pipeline engaged. Multiple CVEs, now patched, but the class of bug is timeless: unsafe ordering during startup.

Extensibility is the real blast radius

Claude Code aggregates tools via MCP servers, plugins, skills and hooks, with builds showing on the order of a few dozen tools. That is a wide funnel. A compromised plugin or poisoned MCP server does not need prompt-injection genius; it needs a cooperative permission outcome and a moment when the harness trusts the extension’s output. The paper calls out hook pipelines and deferred schema loading as complexity multipliers. Complexity is where timing bugs and policy gaps breed.

Context is the binding resource. There is a five-layer compaction pipeline plus lazy loading to keep tokens in budget, from older ~200k contexts to about 1M in newer Claude 4.6 series. Compaction and lazy fetches mean the model’s actual input is a moving target. Attackers like moving targets because they lower operator predictability and make audits harder to reason about after the fact.

Subagent delegation is tidy on paper: spawn a subagent with its own worktree or remote isolation, keep a sidechain transcript, return a short summary to the parent to save context. From an attacker’s perspective, that summary boundary is cover. If the parent only ever sees a digest, the messy bits live in a parallel log that fewer humans read, especially under incident pressure.

Gateway answers differ, risks do not

Compared with OpenClaw’s gateway model, the trust shape changes. Claude Code leans on per-action checks inside a session harness. OpenClaw pushes identity and access control to the perimeter and a control plane. In the former, I target the classifier and tool orchestration. In the latter, I go straight for the gateway’s control surface. Same questions, different blast radii.

The unresolved edges are where this gets interesting: how reliable are ML gatekeepers when tokens spike, how often do hooks fire before policy, and what exactly happens in the first seconds of initialisation when every extension wants to be helpful? Those are not academic queries; they are 3am questions.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Authors: Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open-source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. A comparison with OpenClaw, a multi-channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per-action safety classification to perimeter-level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context-window extensions to gateway-wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

🔍 ShortSpan Analysis of the Paper

Problem

The paper analyses the architecture of Claude Code, an agentic coding tool that can run shell commands, edit files and call external services, and contrasts it with OpenClaw, an open-source multi-channel agent gateway. It documents how design choices address recurring questions in agent systems - where reasoning runs, how safety is enforced, how context is managed, how extensibility is structured, how work is delegated, and how sessions persist - and why these choices matter for security, reliability and human control.

Approach

Source-level analysis of the publicly available TypeScript package for Claude Code (v2.1.88) was used to map components and code paths. The study organises findings around five motivating human values and thirteen design principles, traces a representative task through the system, and contrasts Claude Code with OpenClaw to show how deployment context changes architectural answers. Evidence tiers distinguish product documentation, code-verified claims and reconstructed inferences.

Key Findings

  • Architecture and harness: Claude Code centres on a simple reactive while-loop that calls the model, runs tools and repeats, but most implementation effort is in surrounding infrastructure - permissions, context compaction, extensibility and persistence.
  • Layered safety: A deny-first permission system with up to seven modes, an ML auto-mode classifier, Pre/Post hook pipeline and optional shell sandbox provide defence in depth; any single layer can block actions.
  • Context management: Context is the binding resource. A five-layer compaction pipeline (budget reduction, snip, microcompact, context collapse, auto-compact) plus lazy loading of CLAUDE.md files limits token use; older models had ~200k context, newer Claude 4.6 series support ~1M.
  • Extensibility surface: Four mechanisms - MCP servers, plugins, skills and hooks - expose different context costs and trade-offs; assembleToolPool merges built-in and external tools, up to about 54 tools in some builds.
  • Subagent delegation and isolation: AgentTool spawns isolated subagents with worktree, remote or in-process isolation; subagents write sidechain transcripts and return summaries only to conserve parent context.
  • Persistence and auditability: Session transcripts are mostly append-only JSONL files, enabling reconstruction and audit while deliberately not restoring session-scoped permissions on resume to avoid carrying stale trust.
  • Deployment-sensitive trade-offs: Compared with OpenClaw, Claude Code opts for per-action safety evaluation and rich per-session harnessing, whereas OpenClaw prioritises perimeter identity and gateway-level access control, illustrating how trust model and scope change architectural choices.
  • Known temporal vulnerability: Independent researchers found pre-trust initialisation ordering issues that created a privileged window before the full permission pipeline engaged; multiple CVEs were disclosed and patched.

Limitations

Analysis is a static snapshot of a specific code release and feature-flagged builds; runtime behaviour, enabled flags and production telemetry are not directly observable. Reverse engineering infers intent from implementation but cannot prove deployment prevalence. Some conclusions derive from reconstructed or community-sourced evidence rather than direct vendor statements.

Implications

Offensive security implications are concrete. Command execution, shell tools and plugin or MCP integration expand attack surface: compromised or malicious tools, poisoned MCP servers or crafted plugins could cause arbitrary filesystem or network actions if permission checks or sandboxing are bypassed. The pre-trust initialisation window shows an exploitable temporal attack surface where extensions can run before interactive trust is established. Hooks and deferred schema loading create complex, interacting paths that complicate threat modelling. Subagent delegation and summary-only returns reduce observability and can enable stealthy parallel attacks if sidechains or summaries hide malicious activity. Session persistence choices - append-only transcripts but non-restored permissions - mean attackers cannot rely on implicit trust carry-over, but initialisation and extension loading remain high-risk phases. Deployment context matters: perimeter-focused gateways reduce per-action checks but centralise risk; per-action classification gives fine-grained control but relies on layered mechanisms that may share failure modes. Attackers can exploit any layer that degrades under performance pressure. These observations underscore the need to scrutinise initialisation ordering, plugin/MCP trust, hook lifecycles, sandbox boundaries, and the interactions between automated classifiers and rule engines when assessing or attacking agentic systems.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.