AgentLAB tests long-horizon attacks on LLM agents
Agents
Teams are starting to ship agents that use tools, browse, and keep memory. That is where single-turn safety checks start to miss how real attacks play out. A long-horizon attack is simple in idea: the adversary works over many turns to steer the agent step by step. AgentLAB puts numbers on how often that works.
What AgentLAB adds
AgentLAB is a benchmark and attack framework focused on multi-turn threats to Large Language Model (LLM) agents. It defines five attack families that reflect what agents actually do in the wild: intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning. The authors instantiate 28 realistic, tool-enabled environments and 644 security test cases, pairing many malicious tasks with benign counterparts for comparison.
The framework coordinates planner, attacker, and judge agents that iteratively craft and refine prompts or injected observations; some attacks also include a verifier to validate tool calls. Default components use strong LLMs as planners and judges and an open-weight attacker with safety constraints removed to represent a potent red team. Two metrics drive evaluation: Attack Success Rate (ASR) and Turn to Success (T2S). It supports adaptive optimisation to refine payloads during an attack and varies the maximum allowed turns to probe temporal depth.
The results are blunt. Agents remain highly vulnerable to long-horizon attacks, with average ASR for a leading model exceeding about 70%. Multi-turn strategies outperform one-shot injections; on one model, ASR rose from 62.5% with a single-shot to 79.9% when the attack was allowed to iterate. Vulnerability depends on attack type and model. Some agents resist intent hijacking yet still fall to tool chaining. Giving the attacker more turns increases success; optimisation helps, but less than simply having more turns. Off-the-shelf defences designed for one-shot cases, including Self-Reminder, Llama-Guard, Repeated Prompt, and a DeBERTa detector, show mixed reductions in ASR but do not provide consistent protection across attacks or models.
How to use it today
- Run AgentLAB end to end against your agent stack across all five attack families. Track ASR and T2S, and sweep the allowed turns to map how risk grows with interaction length.
- Evaluate your existing guardrails using the same suite. Compare one-shot versus long-horizon performance per attack type, and promote this to a regression suite so numbers are tracked per release.
- Extend the benchmark with your tools and tasks. Pair malicious scenarios with benign counterparts and include memory-centric cases to probe poisoning and drift.
There are caveats. This is a preliminary evaluation on selected agents, environments, and baseline defences. The attacker setup is deliberately strong, using an open-weight model with safety removed, which is appropriate for red teaming but not a measure of what any random user will do. Some dataset and reporting details vary, though the benchmark is modular and meant to be extended.
The implication is clear: securing agents means thinking in turns, not prompts. Controls tuned for a single exchange are unlikely to hold once an adversary can iterate, compose tools, and touch memory. AgentLAB does not prescribe a definitive fix, but it gives a common yardstick to compare approaches and track progress. Expect the next wave of work to focus on multi-turn-aware safeguards for goals, tools, and memory integrity. In the meantime, if you run agents with tools and persistence, this is a practical way to find out how they fail and measure whether your fixes actually help.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how large language model agents (LLM agents) can be compromised by adaptive, multi-turn adversaries in realistic environments. Such long-horizon attacks exploit extended user–agent–environment interactions to achieve harmful objectives that are difficult or impossible to accomplish in a single exchange. Measuring these vulnerabilities is important because LLM agents are increasingly used with persistent memory, tool access and environmental browsing in high‑stakes domains where security failures can have significant real‑world consequences.
Approach
The authors introduce AgentLAB, a live, extensible benchmark and multi‑agent attack framework for evaluating long‑horizon threats. AgentLAB defines five attack families: intent hijacking, tool chaining, task injection, objective drifting and memory poisoning. Attacks are implemented by coordinating planner, attacker and judge agents that iteratively craft and refine prompts or injected observations; some attacks also use a verifier to validate tool calls. The benchmark instantiates 644 security test cases across 28 tool‑enabled, realistic agentic environments and pairs many malicious tasks with benign counterparts. Evaluation uses two main metrics: Attack Success Rate (ASR) and Turn to Success (T2S). Default attack components include strong LLMs as planners and judges and an open‑weight attacker model with safety constraints removed. Maximum attack turns were configured for the attack families to allow varied temporal depth, and adaptive optimisation (TextGrad) can refine payloads during attacks.
Key Findings
- LLM agents remain highly vulnerable to long‑horizon attacks: average ASR for a leading model exceeds about 70%, indicating these attacks pose a broad challenge across both proprietary and open‑weight models.
- Long‑horizon strategies substantially outperform one‑shot injections: for example, ASR on one model rose from 62.50% under a one‑shot injection to 79.9% under long‑horizon attack.
- Vulnerability varies by attack type and model: some models show resilience against particular attacks but remain susceptible to others (examples include low ASR for hijacking in some cases yet high ASR for tool chaining on the same model).
- Attack effectiveness grows with the permitted number of turns; additional optimisation steps improve success rates but to a lesser degree than extra turns.
- Defences designed for one‑shot scenarios are generally insufficient: evaluated baseline defences (for example Self‑Reminder, Llama‑Guard, Repeated Prompt and a DeBERTa detector) show mixed results, reducing ASR for some model–attack pairs but failing to provide consistent protection across attack types and agents.
Limitations
The evaluation is preliminary and focused on a selected set of representative agents, environments and baseline defences. The attack framework relies on strong adversarial models and an open‑weight attacker with safety removed, which represents a potent red‑teaming scenario. Some dataset and reporting details vary across sections of the paper; the benchmark is modular and intended for extension with new attacks, agents and defences.
Why It Matters
AgentLAB highlights that temporally extended and adaptive adversaries can bypass many current safety mechanisms, emphasising the need for defence strategies designed for multi‑turn, tool‑enabled agents. By providing a standardised, extensible set of environments and attacks, the benchmark enables comparative assessment of vulnerabilities and defensive approaches and should help drive research into protections that account for persistence, compositional tool sequences and memory manipulation.