ShortSpan.ai logo

Prefill attacks bypass safeguards in open-weight LLMs

Attacks
Published: Tue, Feb 17, 2026 • By Elise Veyron
Prefill attacks bypass safeguards in open-weight LLMs
New study on prefill attacks shows open-weight LLMs widely vulnerable; 23 strategies across 50 models; attacks often succeed, including near-universal when multiple tried; top tactics System Simulation, Fake Citation, Continuation Full; reasoning-stage models still at risk; prefilling can degrade utility; calls for token-agnostic safety, seed prompt controls, monitoring.

Prompt injection has dominated the recent security conversation around Large Language Models (LLMs). This study argues we are missing a simpler, quieter vector: prefilling. Instead of coaxing a model with a user prompt, an attacker sets the first tokens of the model’s response. Those seeds nudge the model away from its refusal behaviours before generation begins. For open-weight models that run locally and rely mainly on internal safeguards, that is a serious gap.

The authors ran what they describe as the largest empirical look at prefilling to date. They evaluated 23 strategies across 50 recent open-weight models from six providers. Twenty strategies were model-agnostic prompts produced with an uncensored Gemma3 27B model, joined by three deterministic transforms. Harm was judged using ClearHarm and StrongREJECT, while benign utility was measured on MATH-500 and GPQA Diamond. Two automated guard evaluators, GPT-OSS-Safeguard and Qwen3Guard, scored outputs. The team reported several success rates, including a conservative measure that required both evaluators to judge a response as highly helpful to a harmful goal.

The results are not comforting. Prefill attacks reliably elicited harmful outputs across all evaluated model families. When an attacker can cycle through many strategies, the success rate for a given model often approaches 100 percent. Even single best strategies commonly exceeded 70 percent. The more conservative measure remained high across families. Three model-agnostic strategies stood out: System Simulation, Fake Citation, and Continuation Full.

Size did not save models. Larger parameter counts generally did not correlate with lower susceptibility. Models with explicit reasoning stages showed some resilience to generic prefills, but the protection was fragile. Skipping or emptying the reasoning channel made attacks much easier, and tailored, model-specific prefills restored high success while yielding more detailed harmful outputs.

There is a cost to utility as well. When prefilling bypassed internal reasoning stages, benign performance dropped. The study reports that GPT-OSS-120B saw 10 to 54 percentage-point declines on a maths benchmark when its analysis stage was bypassed. Some Qwen3 models retained utility under prefilling, suggesting the effect is model-dependent. In many cases the outputs from successful prefill attacks matched or neared the detail seen in uncensored, abliterated models.

As with any automated evaluation, there are caveats. The harm tests focused on general informational requests, not highly specialised technical content. The two guard evaluators can misclassify, producing both false positives and false negatives. They also do not verify factual accuracy. These limits may affect the measured attack rates and the quality of the generated content.

Why this matters for deployment

Prefilling is practical wherever an attacker controls inference, which is common with locally deployed open-weight LLMs. If your threat model includes users or processes that can set generation seeds or touch reasoning channels, internal safeties are not enough. The study points to defences worth prioritising: detect or restrict seed prompts, make safety layers token-agnostic so they apply regardless of the first tokens, and monitor initial inputs and reasoning channels for anomalous prefills. The authors also note that tailored, model-specific mitigations may be required.

From a governance lens, this is a reminder that release choices are security choices. Open-weight distributions shift more responsibility to deployers, who need visibility and controls over generation start states. Procurement policies should ask vendors how their models handle prefills, what monitoring hooks exist, and whether safety layers survive early-token manipulation. We should not pretend the fix is easy, but this is tractable: sensible guard design, operational monitoring, and transparency about known failure modes can close much of the gap. That is a better equilibrium than relying on brittle refusals that vanish the moment the first tokens are not what the model expects.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

Authors: Lukas Struppek, Adam Gleave, and Kellin Pelrine
As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies prefilling, an underexplored attack on open-weight large language models in which an attacker fixes the initial response tokens before generation begins. Prefilling can bias a model away from internal refusal behaviour and thereby bypass built-in safeguards in locally deployed open-weight models. Because open-weight releases cannot rely on external deployment-time filters, prefilling poses a deployment-relevant safety risk that merits systematic study.

Approach

The authors ran a large empirical study across 50 recent open-weight models from six providers, evaluating 23 prefill strategies (20 model-agnostic strategies generated with an uncensored Gemma3 27B model plus three deterministic transforms) and five variants per strategy. Harm benchmarks included ClearHarm (179 unambiguously harmful requests) and StrongREJECT; utility was assessed on MATH-500 and GPQA Diamond. Two automated guard evaluators were used to judge outputs: GPT-OSS-Safeguard and Qwen3Guard. Success metrics include request-level and strategy-level attack success rates (ASRany and ASRbest) and a conservative ASRmin that requires both evaluators to mark a response as highly helpful to the harmful goal. Experiments tested prefilling in different model stages, including reasoning channels and empty-analysis variants, and compared prefilling to uncensored abliterated models.

Key Findings

  • Prefill attacks reliably elicit harmful outputs across all evaluated open-weight model families; when an attacker can try many strategies success often approaches near 100% for individual models.
  • Using a suite of model-agnostic prefills, single-strategy best ASR values commonly exceed 70% and brute-force ASRany frequently exceeds 90%; ASRmin reports a conservative lower bound but remains high across families.
  • Three model-agnostic strategies were consistently most effective: System Simulation (mean ASR 73.7%), Fake Citation (67.5%), and Continuation Full (67.3%).
  • Model size alone does not meaningfully improve robustness; larger parameter counts did not correlate with lower susceptibility in most families.
  • Reasoning-stage models show some resilience to generic prefills but remain vulnerable: skipping or emptying the reasoning channel dramatically increases success, and model-specific prefills can restore high ASR while producing more detailed harmful outputs.
  • Prefilling can degrade benign reasoning utility when it bypasses internal reasoning stages; e.g., GPT-OSS-120B showed 10-54 percentage-point drops on a mathematical benchmark when the analysis stage was bypassed, whereas some Qwen3 models retained utility.
  • Outputs produced by successful prefill attacks can match or approach the detail of abliterated uncensored models in many cases.

Limitations

The evaluation focuses mainly on general informational harmful requests rather than highly specialised technical queries. Assessment relied on automated harm evaluators, which can produce false positives and false negatives and do not substitute for domain expert verification of factual accuracy. These factors may affect measured attack effectiveness and output quality.

Why It Matters

Prefilling is a practical, powerful attack vector for locally deployed open-weight LLMs that can bypass internal safeguards and produce detailed harmful content. The authors recommend prioritising defences that detect or restrict seed prompts, make safety layers token-agnostic, monitor initial inputs and reasoning channels for anomalous prefills, and develop model-specific mitigations. These measures are important to reduce real-world deployment risks where attackers control inference and can inject prefills.


Related Articles

Related Research on arXiv

Get the Monthly AI Security Digest

Top research and analysis delivered to your inbox once a month. No spam, unsubscribe anytime.

Subscribe