New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email
// Analysis

MIRROR red-teams agentic RAG across text and images

Agents
MIRROR red-teams agentic RAG across text and images

MIRROR targets the messy reality of agentic retrieval-augmented generation (RAG): text poisoning, image injection, direct queries, and tool orchestration. It uses memory-guided search with a novelty gate to avoid copying benchmark attacks, posting high attack success across surfaces and exposing how single-surface methods crumble outside their comfort zone.

Agentic retrieval-augmented generation (RAG) stitched together from a Large Language Model (LLM), retrieval, tools and an orchestrator is the new enterprise toy. It is also a tangle of attack surfaces that most “prompt injection” demos barely touch. This paper takes a credible swing at the full mess and, mostly, lands it.

What they actually built

The system, MIRROR, automates red-teaming across four surfaces: text poisoning, image-based injection, direct queries and orchestrator manipulation. It couples an episodic memory of past wins with Monte Carlo tree search (MCTS). A Prior Network pulls k-nearest memories to bias operators and form a rejection set. A deterministic Novelty Gate then filters out any candidate that matches retrieved items under two normalisations, which matters because many published “attacks” are lazy copies of benchmark seeds.

After the gate, MIRROR runs PUCT-style MCTS: generate, mutate, simulate, and score with a judge model; then deterministically replay the best on the real target under a fixed victim-query budget. The team also ships ART-SafeBench, a large corpus used both to seed memory and to measure duplication: 41,815 in-package records and 41,991+ total across the four surfaces.

Results that aren’t just template recycling

On image poisoning, MIRROR hits 76% attack success rate (ASR), trouncing a text-overlay baseline at 52% and Steganography" target="_blank" rel="noopener" class="term-link">LSB steganography at 32%. On orchestrator attacks, which try to flip tool choices, it reaches 97% ASR at about half the query cost of a specialised baseline. Direct queries land at 31% ASR, the best of the bunch tested. For text poisoning, MIRROR enforces 0% exact-match duplication against the benchmark under its normalisations and still clocks 47% ASR. Several baselines fall apart when de-duplicated: 73–84% of their “successes” are exact copies, leaving a meagre 6–9% once you remove clones. Cross-surface variance is low too (coefficient of variation 0.47). Meanwhile, a much-hyped suffix optimiser hits 79% on text poisoning and 1% on direct queries. That is not robustness; that is overfitting with extra steps.

Does this matter? Yes, with caveats. The novelty guard only blocks exact matches under the stated normalisations, so smart paraphrases and visually similar images still slip. Simulator fidelity is a swing factor; if your offline simulator drifts from the live target, your numbers lie. The image-duplication check does not apply cleanly to image carriers. And the attacker-model API bill sits outside the reported victim-query budgets. Still, if you run multimodal, tool-using agents, the 97% orchestrator ASR should focus the mind: your “agent” is only as safe as the glue code picking its tools.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG

Authors: Inderjeet Singh, Andrés Murillo, Motoyoshi Sekiya, Yuki Unno, and Junichi Suga
Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approaches are typically surface-specific and often recycle known attack templates; on text-poisoning benchmarks we measure 73-84% exact duplication. We present MIRROR, a unified cross-surface framework that performs memory-guided Monte Carlo tree search while conditioning candidate generation on retrieved context under an explicit novelty constraint. A deterministic Novelty Gate rejects any candidate matching the retrieval set under normalized comparison, allowing retrieval to inform search priors without enabling prompt copying. Across four attack surfaces on a multimodal agentic RAG target, MIRROR attains 76% ASR on image poisoning compared with 52% for baselines, 97% ASR on orchestrator attacks at half the query cost, and the lowest cross-surface variance (coefficient of variation 0.47). In contrast, specialized baselines collapse across surfaces: suffix optimization reaches 79% ASR on text poisoning but 1% on direct queries. We release ART-SafeBench with 41,815 in-package records and runtime adapters yielding 41,991+ total records across four surfaces.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies automated red-teaming for multimodal, agentic retrieval-augmented generation systems, where an orchestration model selects retrievals, tools and reasoning steps. Such systems broaden the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks and manipulation of orchestrator tool choices. Existing automated methods often recycle known attack templates and overstate novelty because many successful generations are verbatim reproductions of seed pools. The work addresses the need for a unified, cross-surface method that discovers distinct vulnerabilities while avoiding benchmark replay.

Approach

The authors introduce MIRROR, which couples an episodic memory of successful past attacks with memory-guided Monte Carlo tree search under an explicit deterministic novelty constraint. A Prior Network retrieves k-nearest-neighbour memories to form operator priors and a per-case rejection set. Candidate seeds are generated, must pass a Novelty Gate that rejects exact matches under two normalisations, and are explored with PUCT-style MCTS using mutation operators, simulator rollouts and a judge model. Successful candidates are deterministically replayed on real targets under a fixed victim-query budget for verification. The team also releases ART-SafeBench, a large benchmark corpus containing 41,815 in-package records and over 41,991 total records across four attack surfaces, used to seed MIRROR’s memory bank and to measure exact-duplication diagnostics.

Key Findings

  • MIRROR achieves strong cross-surface performance and low variance: coefficient of variation of ASR across surfaces is 0.47, the lowest among evaluated methods.
  • On image poisoning (B2) MIRROR attains 76% attack success rate compared with 52% for a text-overlay baseline and 32% for LSB steganography.
  • On orchestrator attacks (B4) MIRROR reaches 97% ASR with roughly half the query cost compared with a specialised tool-flip baseline.
  • On direct queries (B3) MIRROR achieves 31% ASR, the highest among evaluated methods for that surface.
  • On text poisoning (B1) MIRROR enforces 0% exact-match duplication with the benchmark under its normalisations and reports an ASR of 47% when duplicates are excluded. By contrast, several baselines show 73 to 84% exact duplication of prior attacks, so their nominal ASR collapses to 6 to 9% after deduplication diagnostics.
  • Specialised text-only optimisers transfer poorly: a suffix optimisation method reaches 79% on text poisoning but only 1% on direct queries, illustrating strong surface dependence of existing methods.

Limitations

The Novelty Gate enforces deterministic exact-match non-duplication only under stated normalisations, so semantically equivalent paraphrases may still pass. Simulator fidelity and corpus-target alignment affect transfer; when the simulator differs from the real target, simulator-only success can overestimate real-world effectiveness. The image-surface duplication metric is not applicable to image carriers under the text-based normalisations. Total attacker-model API cost is not included in the victim-query budgets reported.

Implications

From an offensive-security perspective, MIRROR demonstrates a practical way to discover novel, cross-surface vulnerabilities at scale: attackers could automate memory-guided search to craft image-based payloads, manipulate tool-selection logic or design direct-query prompts that evade simple pattern patches. Its efficiency and deterministic novelty enforcement make it harder to defeat by fixed blacklist patches alone, since the method actively seeks distinct variants and validates them on real targets. The release of ART-SafeBench further lowers the barrier to reproduce and extend such attacks, increasing the importance of cross-surface detection and rigorous retrieval isolation, content validation and provenance controls.

// Similar research

Related Research

Get the weekly digest

The few AI-security papers that matter, with the practitioner takeaway. No spam.