Research Reveals How Adversaries Break LLM Defences

Attacks

Published: Thu, Oct 19, 2023 • By Theo Solander

Research Reveals How Adversaries Break LLM Defences

A new survey maps how attackers bypass safety controls in Large Language Models (LLMs). It finds jailbreaks, prompt injection and multimodal tricks can defeat alignment, leak data and manipulate workflows. External filters help only marginally. The paper urges threat modelling, red teaming and layered defences as LLMs move into critical systems.

Large Language Model (LLM) deployments are no longer academic projects; they sit inside customer chatbots, search augmentation and multi-agent automation. The survey under review collects recent attacks that show a simple fact: alignment and safety tuning do not equal security. Adversaries routinely craft prompts, visual inputs or poisoned updates that persuade models to reveal information, ignore constraints or issue dangerous instructions.

The catalogue of techniques reads like a map of weak joints. Textual jailbreaks and prompt injection scripts work against safety-aligned models, automated generation of hostile prompts often beats hand-crafted ones, and attacks that target embeddings or retrieval pipelines let an attacker influence model context without touching the core model weights. Multimodal attacks add new surfaces: images can carry instructions that bypass text filters, and federated or multi-agent setups let tainted participants propagate vulnerabilities across a system.

Historical pattern and pragmatic lesson

There is a familiar rhythm here. When a new platform hits scale it attracts both creative use and creative abuse. The early web taught defenders that filters lag and attackers automate at scale; constraints that look secure in lab testing crumble under adaptive adversaries. That history suggests two things. First, expect a cat-and-mouse phase where fixes are brittle. Second, the only durable response is layered, not single-shot: detection, containment, and the ability to recover matter more than any one perfect control.

The survey is careful: it does not claim universal fixes exist. External safeguards such as moderation endpoints and output filters reduce some attacks but cover only the attack space they have seen. Adversarial training and red teaming improve robustness but generalise unevenly. Attacks against federated learning and agent ecosystems show vulnerabilities propagate through complex workflows, so hardening one component while leaving others exposed gives a false sense of security.

For practitioners the implications are straightforward and slightly uncomfortable. Assume alignment failures will happen. Treat prompt content and retrieved context as part of the threat surface. Test under realistic attacker models: white box where internal details are known, black box where only inputs and outputs are visible, and hybrid modes. Invest in continuous red teaming that includes automated prompt generation and multimodal vectors. Audit logs and provenance data are vital for post-incident analysis and for spotting novel attack patterns.

Operational measures matter. Apply defence in depth: input validation, sanity checks on retrieved documents, output filtering with human-in-the-loop escalation, rate limits and provenance markers on sensitive responses. Run adversarial training targeted to known failure modes but accept its limits. Finally, build monitoring and rollback capability so you can isolate and revert compromised agents or model versions without taking down critical services.

The survey gives security teams a taxonomy and a short list of practical tests. It also offers a sober reminder: as LLMs spread into complex systems, security becomes a systems problem. Past cycles of surprise and patching teach a pragmatic posture: prepare to detect, contain and recover rather than chase an elusive perfect alignment.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Authors: Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

🔍 ShortSpan Analysis of the Paper

Problem

Large Language Models are increasingly integrated into complex systems, raising security concerns as adversarial attacks can defeat safety alignment and cause harmful outputs, privacy leaks or manipulation. This public survey examines the rapidly evolving field of adversarial attacks on LLMs, including jailbreaking, prompt injections and more, across unimodal, multimodal and complex workflows, with the aim of taxonomy, threat modelling and guiding tests and mitigations for the cyber security community.

Approach

The authors offer a systematic review and typology of adversarial techniques against LLMs, covering textual only and multimodal attacks, as well as attacks on federated learning and multi agent systems. They describe threat models, attacker access levels (white box, black box, partial), injection sources and end goals, and summarise defensive strategies. The survey draws on contemporary works from 2023 and 2024, including manual and automated jailbreaks, prompt injection methods, prompt leakage, and attacks against augmented and retrieval based systems. It also discusses external safeguards such as content moderation and input output filtering, and highlights the need for deeper evaluation, auditing and red teaming. The paper provides a structured review, a taxonomy of adversarial concepts and practical resources including ACL 24 slides to help researchers and practitioners align on current risks and assessment approaches.

Key Findings

Jailbreak and prompt injection attacks can bypass alignment, generating restricted or harmful outputs even for safety tuned and RLHF aligned models
Ad hoc jailbreak prompts achieve high attack success rates across models such as ChatGPT, GPT 4, ChatGLM, Dolly and Vicuna, with larger models often exhibiting expanded latent capabilities that enable advanced attacks
Automated jailbreak generation methods outperform manual prompts and can transfer to real world chatbots such as Bard and Bing, with results including PoC prompts and greedy suffix techniques
External safeguards like moderation endpoints and output filtering provide only marginal improvements, constrained by limited training data coverage and the evolving attack space
Prompt leakage and injection attacks extend to multimodal and retrieval based systems, introducing new threat surfaces such as visual prompt injections and adversarial embedding space attacks
Attacks against federated learning and multi agent setups show that LLM vulnerabilities propagate through complex systems, with defensive schemes such as m K rum offering partial mitigation
Defence strategies range from auditing with ARCA, adversarial training and red teaming to input and output filtering; however none offer universal protection, underscoring the need for safety capability parity and layered defence

Limitations

The survey captures a fast moving field with uneven coverage across models and tasks. Reported results are often model and prompt dependent, and defenses frequently fail to generalise. External safeguards depend on training data breadth and may be circumvented by adaptive attacks. The literature also notes mismatched generalisation and insufficient safety training data in relation to cutting edge models, making universal robustness elusive and leaving open questions about the efficacy of adversarial training as a universal remedy.

Why It Matters

The review emphasises real world risks in LLM security, including privacy leakage, misalignment in complex AI workflows and vulnerabilities in multi agent and federated settings. It advocates ongoing validation and monitoring, robust auditing and red teaming, cross modality alignment, and defence in depth. By providing a clear taxonomy and actionable resources, the paper supports threat modelling and testing for the cybersecurity community as LLMs continue to scale and integrate into critical systems.

Attribution Original paper on arXiv