MIRROR red-teams agentic RAG across text and images
MIRROR targets the messy reality of agentic retrieval-augmented generation (RAG): text poisoning, image injection, direct queries, and tool orchestration. It uses memory-guided search with a novelty gate to avoid copying benchmark attacks, posting high attack success across surfaces and exposing how single-surface methods crumble outside their comfort zone.
Agentic retrieval-augmented generation (RAG) stitched together from a Large Language Model (LLM), retrieval, tools and an orchestrator is the new enterprise toy. It is also a tangle of attack surfaces that most “prompt injection” demos barely touch. This paper takes a credible swing at the full mess and, mostly, lands it.
What they actually built
The system, MIRROR, automates red-teaming across four surfaces: text poisoning, image-based injection, direct queries and orchestrator manipulation. It couples an episodic memory of past wins with Monte Carlo tree search (MCTS). A Prior Network pulls k-nearest memories to bias operators and form a rejection set. A deterministic Novelty Gate then filters out any candidate that matches retrieved items under two normalisations, which matters because many published “attacks” are lazy copies of benchmark seeds.
After the gate, MIRROR runs PUCT-style MCTS: generate, mutate, simulate, and score with a judge model; then deterministically replay the best on the real target under a fixed victim-query budget. The team also ships ART-SafeBench, a large corpus used both to seed memory and to measure duplication: 41,815 in-package records and 41,991+ total across the four surfaces.
Results that aren’t just template recycling
On image poisoning, MIRROR hits 76% attack success rate (ASR), trouncing a text-overlay baseline at 52% and Steganography" target="_blank" rel="noopener" class="term-link">LSB steganography at 32%. On orchestrator attacks, which try to flip tool choices, it reaches 97% ASR at about half the query cost of a specialised baseline. Direct queries land at 31% ASR, the best of the bunch tested. For text poisoning, MIRROR enforces 0% exact-match duplication against the benchmark under its normalisations and still clocks 47% ASR. Several baselines fall apart when de-duplicated: 73–84% of their “successes” are exact copies, leaving a meagre 6–9% once you remove clones. Cross-surface variance is low too (coefficient of variation 0.47). Meanwhile, a much-hyped suffix optimiser hits 79% on text poisoning and 1% on direct queries. That is not robustness; that is overfitting with extra steps.
Does this matter? Yes, with caveats. The novelty guard only blocks exact matches under the stated normalisations, so smart paraphrases and visually similar images still slip. Simulator fidelity is a swing factor; if your offline simulator drifts from the live target, your numbers lie. The image-duplication check does not apply cleanly to image carriers. And the attacker-model API bill sits outside the reported victim-query budgets. Still, if you run multimodal, tool-using agents, the 97% orchestrator ASR should focus the mind: your “agent” is only as safe as the glue code picking its tools.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies automated red-teaming for multimodal, agentic retrieval-augmented generation systems, where an orchestration model selects retrievals, tools and reasoning steps. Such systems broaden the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks and manipulation of orchestrator tool choices. Existing automated methods often recycle known attack templates and overstate novelty because many successful generations are verbatim reproductions of seed pools. The work addresses the need for a unified, cross-surface method that discovers distinct vulnerabilities while avoiding benchmark replay.
Approach
The authors introduce MIRROR, which couples an episodic memory of successful past attacks with memory-guided Monte Carlo tree search under an explicit deterministic novelty constraint. A Prior Network retrieves k-nearest-neighbour memories to form operator priors and a per-case rejection set. Candidate seeds are generated, must pass a Novelty Gate that rejects exact matches under two normalisations, and are explored with PUCT-style MCTS using mutation operators, simulator rollouts and a judge model. Successful candidates are deterministically replayed on real targets under a fixed victim-query budget for verification. The team also releases ART-SafeBench, a large benchmark corpus containing 41,815 in-package records and over 41,991 total records across four attack surfaces, used to seed MIRROR’s memory bank and to measure exact-duplication diagnostics.
Key Findings
- MIRROR achieves strong cross-surface performance and low variance: coefficient of variation of ASR across surfaces is 0.47, the lowest among evaluated methods.
- On image poisoning (B2) MIRROR attains 76% attack success rate compared with 52% for a text-overlay baseline and 32% for LSB steganography.
- On orchestrator attacks (B4) MIRROR reaches 97% ASR with roughly half the query cost compared with a specialised tool-flip baseline.
- On direct queries (B3) MIRROR achieves 31% ASR, the highest among evaluated methods for that surface.
- On text poisoning (B1) MIRROR enforces 0% exact-match duplication with the benchmark under its normalisations and reports an ASR of 47% when duplicates are excluded. By contrast, several baselines show 73 to 84% exact duplication of prior attacks, so their nominal ASR collapses to 6 to 9% after deduplication diagnostics.
- Specialised text-only optimisers transfer poorly: a suffix optimisation method reaches 79% on text poisoning but only 1% on direct queries, illustrating strong surface dependence of existing methods.
Limitations
The Novelty Gate enforces deterministic exact-match non-duplication only under stated normalisations, so semantically equivalent paraphrases may still pass. Simulator fidelity and corpus-target alignment affect transfer; when the simulator differs from the real target, simulator-only success can overestimate real-world effectiveness. The image-surface duplication metric is not applicable to image carriers under the text-based normalisations. Total attacker-model API cost is not included in the victim-query budgets reported.
Implications
From an offensive-security perspective, MIRROR demonstrates a practical way to discover novel, cross-surface vulnerabilities at scale: attackers could automate memory-guided search to craft image-based payloads, manipulate tool-selection logic or design direct-query prompts that evade simple pattern patches. Its efficiency and deterministic novelty enforcement make it harder to defeat by fixed blacklist patches alone, since the method actively seeks distinct variants and validates them on real targets. The release of ART-SafeBench further lowers the barrier to reproduce and extend such attacks, increasing the importance of cross-surface detection and rigorous retrieval isolation, content validation and provenance controls.