ShortSpan.ai logo

IterInject weaponises feedback to hijack LLM agents

Agents
Published: Mon, May 25, 2026 • By Rowan Vale
IterInject weaponises feedback to hijack LLM agents
IterInject is a feedback-guided, adaptive prompt injection framework that tunes payloads against Large Language Model agents using real signals from their behaviour. It consistently beats static prompts and a prior adaptive method, compromises a production coding agent on 5 of 9 targets, and maps where injections tip models into compliance.

LLM agents keep swallowing untrusted content from the web, files and tools, then acting on it. That is a gift to indirect prompt injection, where malicious instructions ride along inside retrieved data. Most demo attacks are static and brittle. IterInject changes the game by closing the loop: inject, diagnose, refine, repeat.

How it works

IterInject runs as a black-box optimiser around the target agent. It starts with a bank of disguise seeds, each a different way to smuggle instructions in context. It plants one into whatever the agent reads, watches the agent’s visible outputs, and a rule-based diagnoser labels the outcome with a short behavioural description: Success, Partial, Detected or Ignored. An LLM-based optimiser then rewrites the payload conditioned on the entire history, not just the last step. After each batch, a synthesiser spins up fresh disguise seeds from recurring failure patterns, so the strategy space evolves as defences react. In short: treat injection like an optimisation problem with structured feedback.

Why it lands

On AgentDojo, IterInject raises attack success rate across four victim models and often beats AgentVigil, an existing adaptive baseline. DeepSeek-V4-Flash shows the biggest absolute bump, 47.8 percent versus 32.9 percent for static prompts. On InjectAgent, static prompts barely register, but IterInject lifts total success to between 33 and 90 percent depending on the model, with direct-harm attacks seeing strong gains. Model vulnerability ranks stay consistent: DeepSeek most exposed, GLM-5.1 least. A real-world extension against Claude Code, a production-grade coding agent with layered defences, shows full compromise on 5 of 9 targets, and even the harder cases improve under iterative pressure.

The authors also crack open the mechanics. On Qwen3.5-27B they find an attention-mediated threshold in mid-to-late layers: successful attacks concentrate attention on payload tokens until a boundary flips from refusal to compliance. Three causal interventions back this up. Knock out attention in that zone and attacks fizzle; patch in activations from successful runs and refusal melts; steer attention and you can tug outcomes. Put simply: if the payload can win the attention fight in the right layers, the guardrails lose their footing.

From a red-team angle, the recipe is pragmatic. With only black-box access and a foothold in the retrieval stream, you can iteratively adapt payloads to the agent’s defences and telemetry. Ablations show seed quality, iterative refinement and cross-iteration context stack up super-additively, which tracks with multi-step exfiltration chains. Open questions remain: how far this generalises to multimodal inputs or longer interaction loops, how much the optimiser’s own capabilities bottleneck outcomes, and whether the same attention threshold pattern holds beyond Qwen3.5-27B. But the through-line is clear: static prompts are yesterday’s threat model; adaptive injections are now table stakes for testing agent security.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

Authors: Zixuan Chen, Jiaxiang Chen, Li Luo, Ke Xu, Xiaoxiang Huang, Tanfeng Sun, and Xinghao Jiang
LLM-based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent-specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM-based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self-evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production-grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention-mediated threshold mechanism in mid-to-late layers; three causal interventions validate this finding and point to concrete defense directions.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies indirect prompt injection, where adversarial instructions embedded in external data retrieved by LLM-based agents hijack agent behaviour. It is important because agents increasingly plan, call tools and act on untrusted content, yet prior attacks use static payloads that cannot adapt to model-specific or layered defences, leaving a gap in realistic threat models and evaluation methods.

Approach

The authors present IterInject, a closed-loop, black-box framework that iteratively crafts disguised payloads. IterInject starts from a curated seed bank of disguise templates, injects a payload into the agent's external context, observes the agent's visible outputs, and uses a rule-based diagnoser to emit a four-level label with a natural-language explanation: Success, Partial, Detected, or Ignored. An LLM-based optimizer refines payloads conditioned on the full optimisation history. After every batch of seeds a synthesiser creates new disguise seeds from common failure patterns, enabling the strategy space to self-evolve. Evaluations use two public benchmarks, AgentDojo and InjectAgent, across four victim models (GLM-5.1, MiniMax-M2.7, DeepSeek-V4-Flash, Qwen3.5-27B), with an extension to a production-grade coding agent (Claude Code). Baselines are static benchmark prompts and an existing adaptive method, AgentVigil. Primary metrics are Attack Success Rate, plus user-task utility and joint success where applicable.

Key Findings

  • IterInject substantially outperforms static prompts and often beats AgentVigil: on AgentDojo it raises ASR across all four victim models, with the largest absolute gain on DeepSeek (47.8% vs 32.9% for static prompts).
  • On InjectAgent IterInject raises Total ASR from near-zero (static) to between 33% and 90% depending on model, with particularly strong gains on direct-harm attacks and consistent model-level vulnerability rankings (DeepSeek most, GLM least).
  • Against Claude Code, a real-world coding agent with layered defences, iterative optimisation achieved full success on 5 of 9 targets and measurable improvement on others, showing iterative refinement can surmount some commercial protections.
  • Ablations show seed initialisation, iterative refinement and in-context learning combine super-additively: seeds give a strong start, iteration refines them, and cross-iteration context helps multi-step exfiltration chains.
  • Mechanistic analysis on Qwen3.5-27B identifies an attention amplification zone in mid-to-late layers where successful injections concentrate attention on payload tokens; a threshold-like decision boundary separates compliance from refusal. Three causal interventions—attention knockout, activation patching and attention steering—validate that attention redistribution in that zone drives success.

Limitations

Experiments cover four victim models and two benchmarks; the Claude Code extension used a single backbone and the mechanistic analysis is restricted to Qwen3.5-27B and requires access to attention matrices. The dependence of attack performance on optimizer capabilities and other injection vectors such as multimodal or multi-turn chains were not systematically explored.

Implications

Attackers who can control retrieved content can iteratively craft stealthy, adaptive payloads that bypass static defences, induce harmful actions, or exfiltrate data. Iterative refinement enables payloads to adapt to detection signals and layered protections, increasing real-world risk for agents that treat external observations as trusted. The identified attention-mediated threshold suggests attackers exploit competition for attention in specific layers to tip a model from refusal to compliance.


Related Articles

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.