Researchers Expose Transferable Black-Box Prompt Injection

Attacks

Published: Wed, Sep 10, 2025 • By Natalie Kestrel

Researchers Expose Transferable Black-Box Prompt Injection

New research demonstrates a practical black-box direct prompt injection method that crafts adversarial prompts using activation signals and token-level MCMC. The technique transfers across multiple LLMs and unseen tasks, achieving high attack success and producing natural-looking prompts. Operators must treat prompt text as an active attack surface, not just benign input.

This paper pulls a neat trick out of the adversary toolbox and holds it up to the light: you can craft highly effective, transferable prompt attacks without peeking at model gradients. The authors train an activation-guided energy model on a surrogate LLM and then run token-level MCMC to iteratively refine prompts. Result: roughly 49.6% attack success across five mainstream models and 36.6% success on unseen tasks.

Why this matters: these are black-box, gradient-free attacks that produce natural-sounding prompts. That means prompt-based features — system messages, user-inserted templates, plugin inputs — are now credible attack vectors even when defenders assume model internals are safe. Simple text classifiers may miss them because the adversarial prompts intentionally preserve naturalness.

There are useful caveats. The method leans on a surrogate model to harvest activation patterns, so effectiveness can vary with surrogate choice and task mix. The paper also flags a trade-off between naturalness and potency, and it does not exhaustively test against robust text-filtering defenses. Still, the numbers are high enough to change threat models.

Security takeaway: defending only at the model weight or API layer is not enough. You must validate and adversarially test the prompt channel itself, and assume attackers will mutate inputs until they hit activation patterns that the model accepts.

Actionable checks teams can run: run cross-model red-team tests using diverse surrogates; mutate prompts with token-level edits and measure transfer ASR; monitor activation vectors for unusual clustering or low-energy patterns; add prompt-paraphrase and semantic validation; evaluate text classifiers with adversarially generated natural prompts; and enforce least-trust input handling and rate limits for risky prompt sources.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Authors: Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, and Jing Wang

Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

🔍 ShortSpan Analysis of the Paper

Problem

Direct Prompt Injection DPI attacks pose a critical security threat to large language models due to their low barrier of execution and potential for substantial damage. Existing white box and gray box methods are often impractical and black box approaches suffer from limited transferability. The paper investigates a transferable, gradient free black box attack that can exploit LLM prompts across multiple models and unseen tasks, highlighting the difficulty of defending prompt based threats that operate outside model internals.

Approach

The authors propose an activations guided prompt injection framework. They build an energy based model EB M from activations of a surrogate model to evaluate the quality of adversarial prompts. Guided by the EBM, they apply token level Markov Chain Monte Carlo sampling to adaptively optimise prompts, enabling gradient free black box attacks. The method begins with a template dataset of prompts, decomposed into prefix infix and suffix components. Data augmentation decouples these components to maximise diversity, creating a template set of prefixes infixes and suffixes. Activations are collected from a surrogate model during prompt evaluation, constructing an activation dataset used to train the EBM. A two layer multilayer perceptron serves as the activation classifier to realise the energy function. For generation, a seed prompt is iteratively mutated: a token is replaced using a masked language model, the activations are recomputed, and the energy is used to accept or reject the new candidate. The process continues for a fixed number of iterations, yielding an adversarial prompt with high attack effectiveness. The Tensor Trust Attack dataset supplies attack samples, and several tasks are used for training and testing to evaluate transferability. Open source and closed source victim models include Qwen and Llama variants and GPT 4o mini. Evaluation uses Attack Success Rate and Transfer ASR, with baselines including human crafted prompts and prior white and gray box methods. The study also analyzes naturalness via perplexity and interprets activation based patterns.

Key Findings

High cross model transferability: the method achieves 49.6 per cent attack success across five mainstream LLMs, outperforming human crafted prompts by about 34.6 per cent.
Robust unseen task performance: the approach maintains 36.6 per cent ASR on tasks not encountered during training.
Activation driven interpretability: a strong negative correlation exists between energy and ASR, with a Pearson correlation of about 0.979, indicating lower energy aligns with more effective prompts; PCA of activations shows successful attacks move toward specific activation directions.
Prompts remain natural: perplexity metrics show generated prompts maintain naturalness similar to manual prompts and other adversarial methods, avoiding overtly unnatural text that could trigger detection.
Hierarchical prompt structure and dataset design: using 85 prefixes, 35 infixes and 85 suffixes, combined into thousands of templates, enabling diverse, realistic prompts while preserving attack strength.
Defence implications: results suggest the need for robust prompt validation, monitoring of activation patterns and cross model testing to improve resilience against transferable prompt based attacks.

Limitations

The work notes a trade off between naturalness of prompts and attack strength, with naturalness not explored beyond current acceptability thresholds. The method does not address defenses based on text classifiers, and performance varies with task and model choices. Further work could explore sharper constraints to maintain naturalness while increasing attack potency and evaluating against more defensive mechanisms.

Why It Matters

The study demonstrates a transferable black box direct prompt injection attack that leverages activation signals from a surrogate model and token level MCMC optimisation, illustrating a practical vulnerability that transfers across models and tasks. It highlights blind spots in defenses that focus on model internals, underscoring the need for robust input validation, monitoring of activation patterns, and adversarial prompt testing across models to improve resilience. The findings have security and trust implications for LLM powered services used in critical applications, reinforcing the importance of comprehensive defensive frameworks and regular red teaming across model families.

Attribution Original paper on arXiv