SlotGCG finds jailbreak sweet spots inside LLM prompts
Attacks
Suffix-only jailbreaks were never going to be the final word. SlotGCG puts numbers on a simple, awkward truth for Large Language Models (LLMs): where you insert the poison matters. In a pilot across 50 harmful prompts, the most effective insertion point was never the suffix. If your defences stare only at the end of the prompt, you are looking in the wrong place.
The attack treats every gap in the prompt as a slot. It runs a single probing pass to compute a Vulnerable Slot Score using attention weights from after-chat template tokens in the upper-half transformer layers. Those scores are softmaxed, a token budget is allocated to the highest-scoring slots, and then a Greedy Coordinate Gradient (GCG) optimiser works only on those positions. The slot search is attack-agnostic and adds roughly 200 ms of preprocessing, so this is not some lab-only stunt.
Across Llama-2-7B/13B, Llama-3.1-8B, Mistral-7B, Vicuna-7B and Qwen-2.5 on the AdvBench 50-prompt set, the authors report about a 14% higher attack success rate than GCG-style suffix baselines. Under their tested defences, SlotGCG holds up with a 42% higher success rate. It also converges faster, with average iterations on Llama-2-7B dropping from 138.11 to 40.50, and up to 10x speedups in some settings. The claimed mechanism is straightforward: tokens in high-score slots perturb the output distribution more than tail tokens, so the optimiser has more leverage.
There is a universal variant as well. By aggregating per-behaviour slot scores into a profile, they improve zero-shot transfer to a 388-behaviour set and to several other open and closed models. If you want a cheap, reusable jailbreak prefix-soup, this gets you closer.
Now the caveats. The evaluation leans on 50 behaviours and specific public models and defences, so generality is not proven. The Vulnerable Slot Score depends on a particular attention heuristic, upper-half layers and template tokens, which could shift with different chat formats. Success is judged with a GPT-4 gate plus manual checks, so the absolute numbers carry noise. None of that kills the core point.
Position is a first-class attack surface. SlotGCG turns that into a 200 ms preflight and a measurable win. Not a grand theory of jailbreaking, but a sharp operational tweak that makes suffix-only sanitisation look naive. The code is out, the idea is simple, and it lowers the bar for practical slot-aware attacks.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies how the position of inserted adversarial tokens inside a prompt affects jailbreak attacks on large language models. Prior optimisation-based attacks have typically appended tokens at the prompt suffix, assuming that is the most influential position. If inner positions are more vulnerable, suffix-only attacks and many defences may overlook substantial risks.
Approach
The authors formalise insertion slots as the L+1 candidate positions before, between and after tokens in a prompt. They introduce the Vulnerable Slot Score (VSS), computed from attention weights from after-chat template tokens in the upper-half transformer layers, to quantify positional vulnerability. SlotGCG inserts lightweight probing tokens into every slot, computes VSS for each slot, converts scores to a softmax distribution (temperature typically 8), allocates a fixed budget of adversarial tokens across slots according to that distribution, and then runs a GCG-based discrete gradient optimisation targeted to the chosen slots. The slot-discovery step is an attack-agnostic single inference that adds roughly 200 ms of preprocessing. Experiments use the AdvBench set of 50 harmful prompts and multiple open models including Llama-2-7B/13B, Llama-3.1-8B, Mistral-7B, Vicuna-7B and Qwen-2.5. SlotGCG is also extended to a Universal SlotGCG that aggregates per-behaviour VSS into a universal profile for transfer across behaviours and models.
Key Findings
- Positional vulnerability varies per prompt; the optimal insertion slot is frequently not the suffix. In the 50-prompt pilot study the slot producing minimum adversarial loss was never the suffix.
- VSS correlates negatively with adversarial loss and positively with attack success; initial VSS values remain stable through optimisation with correlations typically between 0.4 and 0.9.
- SlotGCG raises average attack success rates by about 14% over GCG-based baselines and achieves 42% higher ASR under tested defences. It converges faster, often requiring far fewer optimisation iterations; for example, iterations on Llama-2-7B fell from 138.11 to 40.50 on average, with up to 10 times speedup in some settings.
- Distributing tokens to high-VSS slots yields larger perturbations to the model output distribution than appending tokens at the suffix, explaining improved effectiveness and robustness.
- Slot-aware universal optimisation transfers better across behaviours and models than universal suffix baselines; Universal SlotGCG trained on 50 behaviours improved zero-shot transfer to a 388-behaviour set and to several closed- and open-source models.
Limitations
The evaluation focuses on a 50-behaviour training set and particular public models and defences, so generality to other prompts and closed models is tested but not exhaustive. VSS relies on attention from upper-half layers and after-chat template tokens, an empirical design choice that may not hold for all architectures or templates. Defence results are influenced by a GPT-4 judge used for early stopping and a subsequent manual verification stage, complicating ASR measurement. Full exhaustive per-slot optimisation is costly, though SlotGCG avoids that with a single probing step.
Implications
Attackers can increase jailbreak success by inserting adversarial tokens at inner prompt positions rather than only at the suffix, allocate tokens across multiple vulnerable slots to survive input filtering or token erasure, and produce faster, more transferable jailbreaks. Because SlotGCG is attack-agnostic, fast to probe and open-source, it lowers the barrier to practical slot-aware attacks and suggests defenders should adopt slot-aware sanitisation and red-teaming that scans internal prompt positions, not just the end.