Researchers Hijack LLM Safety Neurons to Jailbreak Models

Defenses

Published: Tue, Sep 02, 2025 • By Natalie Kestrel

Researchers Hijack LLM Safety Neurons to Jailbreak Models

New research shows a small set of safety neurons inside LLMs largely decide whether models refuse harmful prompts. Attackers can flip those activations to produce jailbreaks with over 97 percent success. The study introduces SafeTuning, a targeted fine-tune that hardens those neurons but flags performance trade offs and dual use risks.

Call it a scalpel for alignment. The paper demonstrates that a tiny subset of 'safety neurons' inside transformer models acts like an internal gate deciding refuse or comply. With clever calibration, researchers flip those neurons and drive models to obey harmful prompts with near perfect success.

That matters because it changes where defenders must look. Too many vendors treat safety as an external filter or policy layer. This work shows the barrier lives inside the model and can be nudged without standard gradient hacking or massive parameter changes. The reported attack succeeds after changing about 0.3 percent of parameters and hits above 97 percent success rates in tests.

On the hopeful side, the authors present SafeTuning, a focused fine tune that strengthens safety neurons and reduces jailbreak success more effectively than baseline methods. But there is no free lunch. Overzealous neuron isolation degrades general performance and can hurt model usefulness. And, as always, an interpretability technique that reveals a lever is also a potential playbook for attackers.

Takeaway for operators: protect internal gates, not just API wrappers. Treat neuron-level findings as a new attack surface during red teams and threat modeling.

Actionable checks teams can run now:

Run jailbreak red team prompts while monitoring middle MLP layer activations to spot correlated spikes.
Compare model outputs before and after small targeted parameter perturbations to assess fragility.
Apply SafeTuning-style fine tuning in a staging environment and measure utility drift on core tasks.
Limit public release of model internals and neuron maps that could aid adversaries.
Include neuron-level tests in CI for alignment regressions.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Authors: Chongwen Zhao and Kaizhu Huang

Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation, a technique known as "Jailbreak." While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model's internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model's behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.

🔍 ShortSpan Analysis of the Paper

Problem

Large Language Models (LLMs) are increasingly used across applications, yet jailbreaks can induce harmful outputs such as illicit instructions or disinformation. While prior defences modify output distributions or detect harmful content, the underlying rationale for jailbreak resilience remains unclear. This work introduces a neuron level interpretability approach that focuses on safety related knowledge neurons within the MLP layers of transformer based LLMs. By projecting the model's internal representations into a stable vocabulary space, the method seeks to reveal how safety gating operates and how it can be manipulated. The authors show that adjusting the activation of safety related neurons can strongly influence behaviour, achieving a mean ASR higher than 97 per cent, and thereby demonstrating a causal role for these neurons in alignment decisions. Building on this insight they propose SafeTuning, a fine tuning strategy that reinforces safety critical neurons to improve robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms four baseline defences. The work offers a new perspective on understanding and defending against jailbreak attacks.

Approach

The authors present a neuron level interpretability method that maps safety knowledge neurons into a human readable vocabulary table, enabling analysis of how benign and harmful prompts activate distinct words. They identify safety critical activations from the MLP layers, define safety knowledge neuron sets and a refined set, and project activation patterns from harmful and benign prompts into a shared vocabulary space. They observe a dual pattern in which harmful prompts trigger rejection knowledge and benign prompts trigger conformity knowledge, with coherent tokens emerging as early as the tenth layer and persisting thereafter. They then introduce a new attack that calibrates the activation of safety neurons to shift responses toward conformity or rejection without using model gradients, preserving standard inference time. Experiments on two models and two subtasks demonstrate near perfect attack success with only about 0.3 per cent of parameters changed, supporting a causal role for safety related neurons. To defend against jailbreaks they propose SafeTuning, which fine tunes the safety related activations using a dataset of harmful prompts paired with refused responses, aiming to harden the model against jailbreak prompts.

Key Findings

Safety related knowledge neurons govern whether the model refuses harmful prompts or yields to them, and deliberate activation changes can produce jailbreak like behaviour with very high success.
The interpretability method maps activations into a vocabulary space, revealing coherent conformity and rejection tokens in middle to late layers and enabling a causal link between neuron activation and output decisions.
A targeted calibration attack can move safety neurons toward conformity or rejection, achieving near perfect attack success with only a small parameter change and without gradient based optimisation.
SafeTuning, a fine tuning strategy that strengthens safety critical neurons, substantially reduces attack success rates across LLMs and outperforms baseline defence methods across tasks.
The attack and defence findings indicate that only a small subset of parameters control the safety barrier, while isolating safety neurons alone can degrade overall model performance unless carefully managed.

Limitations

The study notes that isolating and calibrating safety related neurons can significantly degrade model performance and generalisation, and that tuning too many safety neurons can reduce effectiveness due to overlap with other functional neurons. Experiments are conducted on two open models and a limited set of tasks; results may vary with different architectures or data. The work focuses on internal mechanisms rather than external prompts and acknowledges potential practical constraints when deploying SafeTuning in production.

Why It Matters

The findings emphasise that internal safety gates within LLMs are key drivers of refusals and redirections to safety, and that adjusting their activation can alter responses even in well aligned models. Defences based on internal safety knowledge offer a complementary approach to external filters, with SafeTuning providing a practical method to reduce jailbreak success while preserving usability. The results carry practical implications for developers to guard internal safety gates and consider trade offs with safety calibrations and model usefulness, while also raising concerns about potential exploitation by attackers who could use neuron level insights to bypass or weaken safeguards. Societal impact includes reduced harmful content generation and disinformation, but ongoing research and governance are needed to manage dual use and evolving jailbreak techniques.

Attribution Original paper on arXiv