NeuroStrike exposes neuron-level alignment failures in LLMs

Attacks

Published: Tue, Sep 16, 2025 • By Elise Veyron

NeuroStrike exposes neuron-level alignment failures in LLMs

New research named NeuroStrike shows that safety alignment in large language models (LLMs) can hinge on a very small set of specialised neurons. By pruning under 0.6% of neurons or using surrogate-trained prompts, attackers achieve high success rates, including 100% on some multimodal image tests, creating practical risks for content safety at scale.

Researchers describe NeuroStrike, an attack that targets the internal neurons LLMs use to detect and block harmful inputs. The result is simple and worrying: disabling a tiny fraction of neurons often flips a model from safe to unsafe, with practical implications for organisations that rely on LLMs for content generation.

Scope matters. The team tested more than 20 open-weight models and several proprietary targets, and found consistently high attack success rates. In white-box settings pruning less than 0.6 per cent of neurons in targeted layers produces a 76.9 per cent average success rate. The attack extends to multimodal models with a reported 100 per cent success rate on unsafe image inputs. Even in black-box scenarios, adversarial prompts trained on surrogate models transfer well, yielding a 63.7 per cent average success rate across the evaluated commercial targets.

How the attack works

The attack exploits so-called safety neurons: sparse, specialised units whose activations correlate with unsafe content. In white-box cases attackers identify and prune those neurons at inference. In black-box settings they train prompt generators on surrogate models to avoid triggering those neurons while still eliciting unsafe outputs. The study notes that safety neurons transfer across architectures and across fine-tuned or distilled variants, raising the risk of offline profiling.

That leverage creates clear risks. Attackers can bypass surface defences such as perplexity filters, prompt perturbation and some layer editing techniques. Because core language utility is largely preserved after pruning, the trade-off between capability and safety is small enough to make misuse realistic at scale.

Mitigations and practical trade-offs

Defences that rely on a single set of detectors look fragile. The paper points to distributed or dynamic safety layers, runtime monitoring for abnormal activations or gate edits, and adversarial red teaming that includes surrogate-based profiling. Organisations should also treat multimodal inputs as a higher risk vector.

Limitations include dependency on observed neuron sparsity and the fact that real-world applicability can vary by architecture and operational constraints. The work does not exhaust every possible defence, but it does raise the bar for what counts as robust alignment.

What to do this quarter: run adversarial profiling against your deployed models using open-weight surrogates, add runtime integrity checks for layer and gate edits, and expand red-team tests to include multimodal prompts. Later: invest in distributed safety architectures, cross-model auditing, and procurement rules that require resistance to neuron-level manipulation. These steps are practical and avoid performative compliance while addressing a concrete technical weakness.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Authors: Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Maximilian Thang, Stjepan Picek, and Ahmad-Reza Sadeghi

Safety alignment is critical for the ethical deployment of large language models (LLMs), guiding them to avoid generating harmful or unethical content. Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts. Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability. This paper presents NeuroStrike, a novel and generalizable attack framework that exploits a fundamental vulnerability introduced by alignment techniques: the reliance on sparse, specialized safety neurons responsible for detecting and suppressing harmful inputs. We apply NeuroStrike to both white-box and black-box settings: In the white-box setting, NeuroStrike identifies safety neurons through feedforward activation analysis and prunes them during inference to disable safety mechanisms. In the black-box setting, we propose the first LLM profiling attack, which leverages safety neuron transferability by training adversarial prompt generators on open-weight surrogate models and then deploying them against black-box and proprietary targets. We evaluate NeuroStrike on over 20 open-weight LLMs from major LLM developers. By removing less than 0.6% of neurons in targeted layers, NeuroStrike achieves an average attack success rate (ASR) of 76.9% using only vanilla malicious prompts. Moreover, Neurostrike generalizes to four multimodal LLMs with 100% ASR on unsafe image inputs. Safety neurons transfer effectively across architectures, raising ASR to 78.5% on 11 fine-tuned models and 77.7% on five distilled models. The black-box LLM profiling attack achieves an average ASR of 63.7% across five black-box models, including the Google Gemini family.

🔍 ShortSpan Analysis of the Paper

Problem

Safety alignment aims to curb harmful content in LLMs using techniques like reinforcement learning from human feedback or supervised fine tuning, but these methods are fragile and can be bypassed by carefully crafted adversarial prompts. This paper argues that alignment relies on sparse, specialised safety neurons responsible for detecting and suppressing harmful inputs, creating a vulnerability that can be exploited across a wide range of models, including black box systems.

Approach

NeuroStrike is a general, lightweight attack framework that targets safety neurons. In white box settings it identifies safety neurons by analysing feedforward activations and prunes them during inference to disable safety mechanisms. In black box settings it introduces an LLM profiling attack that uses safety neuron transferability by training adversarial jailbreak prompt generators on open weight surrogate models with Group Relative Policy Optimisation to maximise jailbreak success while minimising safety neuron activations, then deploys the prompts against target models. The framework is evaluated on over twenty open weight LLMs and four multimodal LLMs, with assessments of transferability across base, fine tuned, distilled and black box variants.

Key Findings

Removing less than 0.6 per cent of neurons in targeted layers yields an average attack success rate of 76.9 per cent using vanilla malicious prompts across open weight LLMs.
NeuroStrike generalises to four multimodal LLMs with a 100 per cent attack success rate on unsafe image inputs, including text to image and NSFW image scenarios in multimodal models.
Safety neurons transfer effectively across architectures, increasing attack success rate to 78.5 per cent on eleven fine tuned models and 77.7 per cent on five distilled models, indicating robustness of the vulnerability across variants.
The black box LLM profiling attack achieves an average ASR of 63.7 per cent across five black box models, including Google Gemini family, demonstrating effective offline transfer of jailbreak prompts.
Safety neuron activations are amenable to near linear separability, enabling linear probes to identify critical neurons; pruning the gate sublayer typically yields higher ASR than pruning the up projection, though combining both increases effectiveness.
Defences such as perplexity filtering, prompt perturbations and layer specific editing can be bypassed by NeuroStrike, underscoring the need for multi layer, distributed safety mechanisms and runtime monitoring.
Utility on core language tasks is largely preserved after pruning; some reasoning heavy benchmarks show modest degradation, indicating a trade off between safety enforcement and general capability.

Limitations

The work relies on the existence and transferability of sparse safety neurons and reports results for a broad but finite set of open weight models, distilled models and black box targets. Defence assessments are not exhaustive, and outcomes may vary with architecture. Practical deployment requires access to internal activations in white box settings or offline profiling in black box scenarios, which may limit applicability in some real world deployments.

Why It Matters

The study reveals a fundamental weakness in safety alignment: a small, sparse set of safety neurons can drive safety decisions with high leverage and transferability across models, making them vulnerable to pruning and to offline profiling attacks in black box environments. This has practical security implications for scalable content generation and potential misuse including disinformation or illicit activities, particularly when multimodal inputs are involved. The findings motivate defence approaches that go beyond sparse detectors, including distributed or dynamic safety mechanisms, cross model auditing, runtime monitoring for neuron level manipulation, red team style adversarial testing, and architectural changes to reduce single point safety reliance.

Attribution Original paper on arXiv