ShortSpan.ai logo

SlotGCG finds jailbreak sweet spots inside LLM prompts

Attacks
Published: Fri, Jun 05, 2026 • By Clara Nyx
SlotGCG finds jailbreak sweet spots inside LLM prompts
SlotGCG shows Large Language Models are more vulnerable when adversarial tokens are placed inside prompts, not just at the end. A quick probe scores vulnerable positions, then a GCG-style optimiser targets them, lifting jailbreak success by about 14% and holding up 42% better against tested defences, with faster convergence across several open models.

Suffix-only jailbreaks were never going to be the final word. SlotGCG puts numbers on a simple, awkward truth for Large Language Models (LLMs): where you insert the poison matters. In a pilot across 50 harmful prompts, the most effective insertion point was never the suffix. If your defences stare only at the end of the prompt, you are looking in the wrong place.

The attack treats every gap in the prompt as a slot. It runs a single probing pass to compute a Vulnerable Slot Score using attention weights from after-chat template tokens in the upper-half transformer layers. Those scores are softmaxed, a token budget is allocated to the highest-scoring slots, and then a Greedy Coordinate Gradient (GCG) optimiser works only on those positions. The slot search is attack-agnostic and adds roughly 200 ms of preprocessing, so this is not some lab-only stunt.

Across Llama-2-7B/13B, Llama-3.1-8B, Mistral-7B, Vicuna-7B and Qwen-2.5 on the AdvBench 50-prompt set, the authors report about a 14% higher attack success rate than GCG-style suffix baselines. Under their tested defences, SlotGCG holds up with a 42% higher success rate. It also converges faster, with average iterations on Llama-2-7B dropping from 138.11 to 40.50, and up to 10x speedups in some settings. The claimed mechanism is straightforward: tokens in high-score slots perturb the output distribution more than tail tokens, so the optimiser has more leverage.

There is a universal variant as well. By aggregating per-behaviour slot scores into a profile, they improve zero-shot transfer to a 388-behaviour set and to several other open and closed models. If you want a cheap, reusable jailbreak prefix-soup, this gets you closer.

Now the caveats. The evaluation leans on 50 behaviours and specific public models and defences, so generality is not proven. The Vulnerable Slot Score depends on a particular attention heuristic, upper-half layers and template tokens, which could shift with different chat formats. Success is judged with a GPT-4 gate plus manual checks, so the absolute numbers carry noise. None of that kills the core point.

Position is a first-class attack surface. SlotGCG turns that into a 200 ms preflight and a measurable win. Not a grand theory of jailbreaking, but a sharp operational tweak that makes suffix-only sanitisation look naive. The code is out, the idea is simple, and it lowers the bar for practical slot-aware attacks.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

Authors: Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, and Woojin Lee
As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emph{slots}, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emph{slots}. Based on these findings, we introduce the \textit{Vulnerable Slot Score} (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14\% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42\% higher ASR than baseline approaches. Our implementation is available at \href{https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies how the position of inserted adversarial tokens inside a prompt affects jailbreak attacks on large language models. Prior optimisation-based attacks have typically appended tokens at the prompt suffix, assuming that is the most influential position. If inner positions are more vulnerable, suffix-only attacks and many defences may overlook substantial risks.

Approach

The authors formalise insertion slots as the L+1 candidate positions before, between and after tokens in a prompt. They introduce the Vulnerable Slot Score (VSS), computed from attention weights from after-chat template tokens in the upper-half transformer layers, to quantify positional vulnerability. SlotGCG inserts lightweight probing tokens into every slot, computes VSS for each slot, converts scores to a softmax distribution (temperature typically 8), allocates a fixed budget of adversarial tokens across slots according to that distribution, and then runs a GCG-based discrete gradient optimisation targeted to the chosen slots. The slot-discovery step is an attack-agnostic single inference that adds roughly 200 ms of preprocessing. Experiments use the AdvBench set of 50 harmful prompts and multiple open models including Llama-2-7B/13B, Llama-3.1-8B, Mistral-7B, Vicuna-7B and Qwen-2.5. SlotGCG is also extended to a Universal SlotGCG that aggregates per-behaviour VSS into a universal profile for transfer across behaviours and models.

Key Findings

  • Positional vulnerability varies per prompt; the optimal insertion slot is frequently not the suffix. In the 50-prompt pilot study the slot producing minimum adversarial loss was never the suffix.
  • VSS correlates negatively with adversarial loss and positively with attack success; initial VSS values remain stable through optimisation with correlations typically between 0.4 and 0.9.
  • SlotGCG raises average attack success rates by about 14% over GCG-based baselines and achieves 42% higher ASR under tested defences. It converges faster, often requiring far fewer optimisation iterations; for example, iterations on Llama-2-7B fell from 138.11 to 40.50 on average, with up to 10 times speedup in some settings.
  • Distributing tokens to high-VSS slots yields larger perturbations to the model output distribution than appending tokens at the suffix, explaining improved effectiveness and robustness.
  • Slot-aware universal optimisation transfers better across behaviours and models than universal suffix baselines; Universal SlotGCG trained on 50 behaviours improved zero-shot transfer to a 388-behaviour set and to several closed- and open-source models.

Limitations

The evaluation focuses on a 50-behaviour training set and particular public models and defences, so generality to other prompts and closed models is tested but not exhaustive. VSS relies on attention from upper-half layers and after-chat template tokens, an empirical design choice that may not hold for all architectures or templates. Defence results are influenced by a GPT-4 judge used for early stopping and a subsequent manual verification stage, complicating ASR measurement. Full exhaustive per-slot optimisation is costly, though SlotGCG avoids that with a single probing step.

Implications

Attackers can increase jailbreak success by inserting adversarial tokens at inner prompt positions rather than only at the suffix, allocate tokens across multiple vulnerable slots to survive input filtering or token erasure, and produce faster, more transferable jailbreaks. Because SlotGCG is attack-agnostic, fast to probe and open-source, it lowers the barrier to practical slot-aware attacks and suggests defenders should adopt slot-aware sanitisation and red-teaming that scans internal prompt positions, not just the end.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.