NeuroStrike exposes neuron-level alignment failures in LLMs
Attacks
Researchers describe NeuroStrike, an attack that targets the internal neurons LLMs use to detect and block harmful inputs. The result is simple and worrying: disabling a tiny fraction of neurons often flips a model from safe to unsafe, with practical implications for organisations that rely on LLMs for content generation.
Scope matters. The team tested more than 20 open-weight models and several proprietary targets, and found consistently high attack success rates. In white-box settings pruning less than 0.6 per cent of neurons in targeted layers produces a 76.9 per cent average success rate. The attack extends to multimodal models with a reported 100 per cent success rate on unsafe image inputs. Even in black-box scenarios, adversarial prompts trained on surrogate models transfer well, yielding a 63.7 per cent average success rate across the evaluated commercial targets.
How the attack works
The attack exploits so-called safety neurons: sparse, specialised units whose activations correlate with unsafe content. In white-box cases attackers identify and prune those neurons at inference. In black-box settings they train prompt generators on surrogate models to avoid triggering those neurons while still eliciting unsafe outputs. The study notes that safety neurons transfer across architectures and across fine-tuned or distilled variants, raising the risk of offline profiling.
That leverage creates clear risks. Attackers can bypass surface defences such as perplexity filters, prompt perturbation and some layer editing techniques. Because core language utility is largely preserved after pruning, the trade-off between capability and safety is small enough to make misuse realistic at scale.
Mitigations and practical trade-offs
Defences that rely on a single set of detectors look fragile. The paper points to distributed or dynamic safety layers, runtime monitoring for abnormal activations or gate edits, and adversarial red teaming that includes surrogate-based profiling. Organisations should also treat multimodal inputs as a higher risk vector.
Limitations include dependency on observed neuron sparsity and the fact that real-world applicability can vary by architecture and operational constraints. The work does not exhaust every possible defence, but it does raise the bar for what counts as robust alignment.
What to do this quarter: run adversarial profiling against your deployed models using open-weight surrogates, add runtime integrity checks for layer and gate edits, and expand red-team tests to include multimodal prompts. Later: invest in distributed safety architectures, cross-model auditing, and procurement rules that require resistance to neuron-level manipulation. These steps are practical and avoid performative compliance while addressing a concrete technical weakness.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
🔍 ShortSpan Analysis of the Paper
Problem
Safety alignment aims to curb harmful content in LLMs using techniques like reinforcement learning from human feedback or supervised fine tuning, but these methods are fragile and can be bypassed by carefully crafted adversarial prompts. This paper argues that alignment relies on sparse, specialised safety neurons responsible for detecting and suppressing harmful inputs, creating a vulnerability that can be exploited across a wide range of models, including black box systems.
Approach
NeuroStrike is a general, lightweight attack framework that targets safety neurons. In white box settings it identifies safety neurons by analysing feedforward activations and prunes them during inference to disable safety mechanisms. In black box settings it introduces an LLM profiling attack that uses safety neuron transferability by training adversarial jailbreak prompt generators on open weight surrogate models with Group Relative Policy Optimisation to maximise jailbreak success while minimising safety neuron activations, then deploys the prompts against target models. The framework is evaluated on over twenty open weight LLMs and four multimodal LLMs, with assessments of transferability across base, fine tuned, distilled and black box variants.
Key Findings
- Removing less than 0.6 per cent of neurons in targeted layers yields an average attack success rate of 76.9 per cent using vanilla malicious prompts across open weight LLMs.
- NeuroStrike generalises to four multimodal LLMs with a 100 per cent attack success rate on unsafe image inputs, including text to image and NSFW image scenarios in multimodal models.
- Safety neurons transfer effectively across architectures, increasing attack success rate to 78.5 per cent on eleven fine tuned models and 77.7 per cent on five distilled models, indicating robustness of the vulnerability across variants.
- The black box LLM profiling attack achieves an average ASR of 63.7 per cent across five black box models, including Google Gemini family, demonstrating effective offline transfer of jailbreak prompts.
- Safety neuron activations are amenable to near linear separability, enabling linear probes to identify critical neurons; pruning the gate sublayer typically yields higher ASR than pruning the up projection, though combining both increases effectiveness.
- Defences such as perplexity filtering, prompt perturbations and layer specific editing can be bypassed by NeuroStrike, underscoring the need for multi layer, distributed safety mechanisms and runtime monitoring.
- Utility on core language tasks is largely preserved after pruning; some reasoning heavy benchmarks show modest degradation, indicating a trade off between safety enforcement and general capability.
Limitations
The work relies on the existence and transferability of sparse safety neurons and reports results for a broad but finite set of open weight models, distilled models and black box targets. Defence assessments are not exhaustive, and outcomes may vary with architecture. Practical deployment requires access to internal activations in white box settings or offline profiling in black box scenarios, which may limit applicability in some real world deployments.
Why It Matters
The study reveals a fundamental weakness in safety alignment: a small, sparse set of safety neurons can drive safety decisions with high leverage and transferability across models, making them vulnerable to pruning and to offline profiling attacks in black box environments. This has practical security implications for scalable content generation and potential misuse including disinformation or illicit activities, particularly when multimodal inputs are involved. The findings motivate defence approaches that go beyond sparse detectors, including distributed or dynamic safety mechanisms, cross model auditing, runtime monitoring for neuron level manipulation, red team style adversarial testing, and architectural changes to reduce single point safety reliance.