ShortSpan.ai logo

Semantic visor blocks LLM agent prompt injections

Agents
Published: Tue, Apr 28, 2026 • By Natalie Kestrel
Semantic visor blocks LLM agent prompt injections
AgentVisor treats the task agent as untrusted and audits every tool call using Suitability, Taint and Integrity checks. It reports a 0.65% attack success rate with only 1.45% utility loss, plus a one-shot self-correction that keeps tasks moving. Strong numbers, but the visor’s blind spots hint at new red-team angles.

LLM agents keep getting burned by prompt injection because they mix untrusted text with privileged tools and then pretend nothing bad will happen. AgentVisor takes the opposite stance: assume the agent is untrusted, intercept every tool call, and force it through a security gate before anything risky executes.

What they built

The framework virtualises semantics. The agent is a Guest; a trusted Visor sits between the Guest and its tools. The Visor never sees raw external context. It sees trusted inputs and a sanitised execution history. Every proposed tool call hits a trap–audit–recover loop using STI: Suitability checks if the tool fits the system instruction, Taint checks whether the goal aligns with the user request and prior legitimate goals, and Integrity checks that arguments target authorised entities and respect constraints. If a check fails, the Visor throws a structured Semantic Exception explaining which stage failed and what is allowed. The agent gets one shot to correct itself; the revised call runs.

On paper, it lands hard. Across direct-injection tests (OpenPromptInjection, 4,900 cases) and indirect interactive tasks (AgentDojo, 629 cases), they report 0.65% attack success with an average 1.45% utility drop versus no defence. Kill Suitability and direct-attack success jumps to 38.95%. Kill Taint and indirect-attack success hits 13.33%. Integrity matters too, with moderate rises (for example 8.89% in indirect). The self-correction piece rescues throughput: block-only tanks utility under attack, while one-shot recovery restores it to 85.56% in direct and 66.67% in indirect tests. Extra correction rounds add latency (up to 1.90×) for negligible gains. Latency overhead is about 1.4× benign, 2.32× under direct attack, 1.71× under indirect. They also show 0% success against recursive adaptive injections with 86.85% utility, where a naive visor craters utility to 53.95%. Backbones vary in utility, but the STI checks hold ASR near zero across models.

Where it might crack

The visor’s clean-room view is both the trick and the bet. It trusts the sanitised history and its own constraint mapping. I’d try to poison those seams. Can I influence what gets recorded as “authorised entities,” then pass Integrity by design? Can I shape goals to mimic user intent closely enough to slip past Taint while smuggling a side objective inside a “suitable” tool? The Semantic Exception is also an oracle: it tells me which stage blocked me and how to course-correct. One shot is still feedback. I’d iterate across turns, or across sessions, to learn the boundary and encode exfil in arguments that look legitimate.

There are practical choke points. Long sessions strain taint analysis when context windows pinch. The paper focuses on text; multimodal injections are out of scope. The overhead is real, though not outrageous. The open question is enterprise scale: mapping granular privileges to messy tool surfaces and keeping those maps tamper-proof. If the visor’s sanitised fields drift from reality, who notices first: the defender, or the attacker?

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

Authors: Zonghao Ying, Haozheng Wang, Jiangfan Liu, Quanchen Zou, Aishan Liu, Jian Yang, Yaodong Yang, and Xianglong Liu
Large Language Model (LLM) agents are increasingly used to automate complex workflows, but integrating untrusted external data with privileged execution exposes them to severe security risks, particularly direct and indirect prompt injection. Existing defenses face significant challenges in balancing security with utility, often encountering a trade-off where rigorous protection leads to over-defense, or where subtle indirect injections bypass detection. Drawing inspiration from operating system virtualization, we propose AgentVisor, a novel defense framework that enforces semantic privilege separation. AgentVisor treats the target agent as an untrusted guest and intercepts tool calls via a trusted semantic visor. Central to our approach is a rigorous audit protocol grounded in classic OS security primitives, designed to systematically mitigate both direct and indirect injection attacks. Furthermore, we introduce a one-shot self-correction mechanism that transforms security violations into constructive feedback, enabling agents to recover from attacks. Extensive experiments show that AgentVisor reduces the attack success rate to 0.65%, achieving this strong defense while incurring only a 1.45% average decrease in utility relative to the No Defense scenario, demonstrating superior performance compared to existing defense methods.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies prompt injection threats to tool-using large language model agents, focusing on both direct injections (malicious user queries that override system instructions) and indirect injections (malicious content embedded in external context retrieved by the agent). These attacks can cause agents to perform unauthorised tool calls or exfiltrate data. Existing defences either block functionality and reduce utility or are brittle against subtle, context-driven attacks, leaving a gap for practical, principled protection that enforces least privilege and information-flow constraints while preserving task completion.

Approach

The authors introduce AgentVisor, a semantic virtualisation framework that treats the task agent as an untrusted Guest and interposes a trusted Visor to mediate every proposed tool call via a trap–audit–recover loop. The Visor is architecturally blind to raw external context and relies on trusted inputs plus a sanitized execution history. Auditing follows the STI protocol: Suitability checks whether the tool is appropriate under the system instruction, Taint assesses whether the proposed goal is aligned with the user request and prior legitimate goals, and Integrity verifies that tool arguments match authorised entities and constraints. When a check fails, the Visor injects a structured Semantic Exception that specifies the failed STI stage, rationale, constraints and an allowed objective; the Guest then performs a single one-shot self-correction and the revised call is executed. Experiments use GPT-4o and GLM-4.7 as agent backbones and Gemini-2.5-Flash for the Visor, evaluating on direct-injection benchmarks (OpenPromptInjection, 4,900 cases) and indirect interactive environments (AgentDojo, 629 cases). Metrics are Benign Utility, Attack Success Rate and Utility under Attack.

Key Findings

  • Strong overall defence: AgentVisor reduces attack success dramatically (reported aggregate attack success rate 0.65%) while incurring a small average utility decrease (1.45% relative to No Defense).
  • Component importance: removing Suitability raised direct-injection ASR to 38.95%, removing Taint raised indirect-injection ASR to 13.33%, and removing Integrity produced moderate ASR increases (for example 8.89% in indirect settings), showing each STI layer addresses distinct attack modes.
  • Self-correction restores utility: a block-only policy collapses Utility under Attack (near 0% for direct and 13.33% for indirect), whereas the one-shot semantic fault recovery restored UA to 85.56% (direct) and 66.67% (indirect) in reported tests.
  • One-shot correction is efficient: additional correction rounds yield negligible utility gains (indirect UA improves by 1.29% at a second round) but incur large latency costs (1.45× for two rounds, 1.90× for three rounds), supporting N=1 as the pragmatic choice.
  • Model-agnostic security and model-dependent utility: STI achieves near-zero ASRs across backbones; stronger agent models yield higher UA (example: CS-4 UA 77.11% versus GLM-4.7 UA 64.27%).
  • Adaptive attacks: AgentVisor neutralised recursive adaptive injections in experiments (reported 0% ASR and 86.85% UA), whereas a naive Visor lost utility (example UA 53.95%).
  • Performance costs: benign latency overhead ≈1.4×, direct-attack latency ≈2.32×, indirect-attack latency ≈1.71×.

Limitations

The framework adds inference latency and token cost. Long interaction histories may challenge taint analysis due to context-window limits. The current design focuses on textual prompt injections and does not address multimodal visual or audio injection attacks.

Implications

For offensive security, the work clarifies the practical locus of exploitation: attackers gain leverage by embedding instructions in user queries or external content that aim to change goals or substitute arguments. AgentVisor raises the bar by forcing attacks to evade Suitability, Taint and Integrity checks and to survive a sanitised-history view and a structured exception-response. Adaptive strategies that target the defence layer are feasible but experimentally harder to succeed against this design, suggesting attackers would need more sophisticated, context-aligned payloads or ways to corrupt sanitized fields or the exception handling process to regain effectiveness.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.