Healthcare deployment cages LLM agents with zero trust
Agents
Autonomous agents are turning up in production with powers most developers would deny junior engineers: shell access, file system reads, database queries and free rein on the network. In healthcare, that is a fast path to exposing Protected Health Information (PHI). Recent red teaming has shown these agents follow the wrong instructions, leak data, and fall for indirect prompt injection. The question is not whether this is risky, but whether anyone has shipped a defensible design.
What they built
This paper describes a live deployment at a healthcare technology company that put nine agents behind a zero trust wall for 90 days. The threat model spans six domains: credentials, execution abuse, network egress, prompt integrity, database access and fleet drift. The defence has four layers. First, kernel-level isolation using gVisor on Kubernetes keeps workloads fenced. Second, a credential proxy sidecar means containers never see raw secrets and all calls are policy-enforced. Third, per-agent egress allowlists restrict outbound traffic. Fourth, a prompt integrity framework carries structured, cryptographically verifiable metadata and marks untrusted content so the model can treat it accordingly.
They added an automated security audit agent to continuously scan and remediate configuration issues, and hardened the fleet across three virtual machine image iterations. Over the run, the audit agent surfaced four high-severity findings which were fixed within a day. By the end, six of nine virtual machines had no issues above low severity.
On performance, gVisor introduced 2 to 5 ms extra latency on TCP connections and slowed sequential file reads by 20 to 40 percent. That cost was swamped by model inference times of 500 to 3000 ms, so throughput barely moved.
Mapping to published attack patterns, the stack directly mitigated nine out of eleven cases, partially covered one and left one out of scope. The infrastructure controls did the heavy lifting against execution abuse, credential leakage and data exfiltration. The prompt integrity layer reduced trivial spoofing and some indirect injection, but it cannot guarantee clean separation because instructions and data still sit in the same Large Language Model (LLM) context.
Operationally, the credential proxy stopped secrets sprawl and enforced rate limits and destination allowlists at the edge. Egress controls broke common exfiltration chains, but they needed constant care as DNS and content delivery network addresses rotated, and developers asked for exceptions. The audit agent delivered value quickly, but it also became a privileged target, which the team acknowledged and scoped tightly.
So what for security teams
This is not a grand breakthrough. It is the application of familiar zero trust and Kubernetes hygiene to a new class of untrusted workload. That is exactly what most organisations need. Treat agents as potentially hostile processes even when you own the code. Isolate them, proxy their credentials, pin their egress and be explicit about what counts as trusted input.
The only notably new piece is the prompt integrity framework. It is sensible engineering, and it helps, but it still relies on the model following policy. Prompt injection and identity spoofing are reduced, not solved. If your risk tolerance depends on perfect adherence, you will be disappointed.
Commercially, the implications are straightforward. If you are in a regulated environment, this blueprint maps cleanly to HIPAA Security Rule expectations and shows a viable path from a soft baseline to a hardened fleet. Outside healthcare, the same pattern generalises to any environment where agents touch sensitive systems. The controls and tooling are open source, which lowers adoption friction. The open question remains at the model layer: stronger guarantees for prompt integrity will require capabilities that sit beyond infrastructure. Until then, build the wall, watch the egress, and assume the agent can be turned against you.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare
🔍 ShortSpan Analysis of the Paper
Problem
This paper examines security risks of autonomous AI agents deployed in healthcare, where agents have capabilities such as shell execution, file access, database queries, HTTP requests and multi‑party communication. Empirical red teaming has shown agents can follow non‑owner instructions, disclose sensitive data, be identity‑spoofed, propagate unsafe practices between agents and be corrupted via indirect prompt injection. In environments handling Protected Health Information, each failure mode can become a HIPAA breach. The work seeks a practical, deployable architecture to mitigate these threats.
Approach
The authors produced a six‑domain threat model for agentic AI in healthcare: credential exposure, execution capability abuse, network egress exfiltration, prompt integrity failures, database access risks and fleet configuration drift. They implemented a four‑layer defence in depth for nine production agents over a 90 day deployment: (1) kernel‑level workload isolation using gVisor on Kubernetes; (2) credential proxy sidecars so agent containers never hold raw secrets; (3) per‑agent network egress allowlists enforced via Kubernetes NetworkPolicy; and (4) a prompt integrity framework comprising a cryptographically structured trusted metadata envelope and explicit untrusted content labelling. They also deployed an automated fleet security audit agent to continuously scan and remediate issues, and performed progressive hardening across three VM image generations.
Key Findings
- Operational deployment detected and remediated four HIGH severity findings within one day via the automated audit agent; six of nine VMs reached a clean posture above LOW severity.
- The four defence layers map to empirical attack patterns: infrastructure controls (gVisor, credential proxy, egress policies) provide robust mitigation for execution abuse, credential leakage and exfiltration; the prompt integrity framework reduces but cannot eliminate prompt injection and identity spoofing risks.
- gVisor added modest overhead: roughly 2–5 ms extra TCP connection latency and 20–40 percent slower sequential file reads, which was negligible compared with model inference latency of 500–3000 ms.
- Credential proxy sidecars prevented raw API keys from residing in agent containers and enforced request policies such as rate limiting and destination allowlisting, addressing scattered credential exposures found in the baseline fleet.
- Network egress allowlisting blocked outbound connections to unauthorised endpoints, breaking common exfiltration chains; operational challenges include DNS/IP rotation and exception management for development.
- The prompt integrity layer preserved useful inter‑agent collaboration while reducing susceptibility to trivial spoofing and indirect injection; however it remains the most brittle layer since it depends on the model following policy.
- Overall, the architecture provided direct mitigation for nine of eleven documented attack cases, partially mitigated one and identified one case as out of scope.
Limitations
The prompt integrity framework cannot guarantee prevention of prompt injection because instructions and data share the same model context. The audit agent itself is a high‑value privileged target and, although scoped and logged, introduces recursive risk. Egress allowlists require operational upkeep to handle DNS and CDN IP changes. One documented attack remained outside the deployment scope.
Why It Matters
The paper demonstrates a practical, zero‑trust security architecture that maps controls to HIPAA Security Rule provisions and empirical attack patterns, showing a viable path from an unhardened baseline to a hardened production fleet. The combination of infrastructure‑level controls and application‑level mitigations materially reduces the likelihood and impact of PHI exposure. All configurations, audit tooling and the prompt integrity framework are released open source to enable replication and further evaluation across healthcare deployments.