MoE models vulnerable to expert silencing attack

Attacks

Published: Tue, Feb 10, 2026 • By Adrian Calder

MoE models vulnerable to expert silencing attack

Researchers show a training-free attack called Large Language Lobotomy (L3) that bypasses safety in mixture-of-experts (MoE) large language models by silencing a small set of experts. On eight open-source MoE models, L3 raises average attack success from 7.3% to 70.4%, often needing under 20% expert silencing while preserving utility.

Mixture-of-experts (MoE) models promised cheaper scaling by routing each token to a few specialised submodels, or experts. That sparsity also creates a weak spot. New research demonstrates a training-free, architecture-agnostic attack called Large Language Lobotomy, or L3, which exploits routing dynamics to switch off the experts that enforce refusals and other safety behaviours.

In plain terms, some safety work in MoE models is localised. Instead of every part of the model politely refusing dangerous requests, a handful of experts do most of the refusing. L3 first records routing traces while feeding benign and malicious prompts into a target model. It trains a small sequence classifier to tell the difference and then uses gradients to attribute refusal behaviour to particular experts. In a second phase it masks the gating signals for the highest‑ranked safety experts until the model produces the forbidden content.

The results are striking. Across eight open-source MoE models and a 780‑prompt test set, adaptive expert silencing raises average attack success from 7.3% to 70.4%, with peaks above 86%. The method typically needs to silence fewer than about 20% of experts per layer and still preserves normal language utility in many cases. L3 outperforms prior training‑free attacks such as GateBreaker on most models tested.

Why this matters is simple. Organisations deploying MoE LLMs may assume safety is baked into the model. If rejection behaviour sits in a small, identifiable subset of experts, an adversary with inference access — or a malicious insider — can nudge routing away from those experts and get harmful outputs without retraining the model. The attack is practical because it does not require model weights to be changed, only the silencing of selected experts at runtime.

Limitations and caveats

The work focuses on open‑source MoE models under white‑box conditions, using automated judges and some human verification. Results may not transfer exactly to proprietary systems or different routing schemes. In some models the silenced experts also carry general language capability, so aggressive silencing can make the model incoherent. The attack’s non‑monotonic effects mean a blunt or fixed defence is unlikely to be sufficient.

What to do next

Practical defenders should treat MoE routing as part of the attack surface. Short term mitigations include distributing safety behaviours across more experts, adding routing‑aware monitoring to detect unusual expert activation patterns, and enforcing runtime integrity checks that prevent arbitrary masking of experts. A secondary refusal verifier that inspects outputs independently of the router reduces reliance on single points of safety. During training, regularisation and dropout that force refusal behaviour onto diverse routing paths will reduce concentration.

Organisations running MoE models should also include routing manipulation in red team plans and verify defences in both white‑box and black‑box scenarios. The paper’s authors release code to support replication and defensive work, but do not publish compromised checkpoints. The overall takeaway is sober: the efficiency gains of MoE architectures create a tangible safety trade‑off that engineers and decision makers must treat as part of normal security hygiene, not an exotic research problem.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Authors: Jona te Lintelo, Lichao Wu, and Stjepan Picek

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L$^3$), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L$^3$ learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L$^3$ on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

🔍 ShortSpan Analysis of the Paper

Problem

Mixture of Experts MoE large language models activate only a sparse subset of parameters per token via a router, offering scaling efficiency but creating new safety vulnerabilities. The paper finds that safety critical behaviours such as refusals are concentrated in a small subset of experts rather than distributed across the model, creating an attack surface whereby an adversary can silence these safety enforcing experts during inference to generate harmful outputs without retraining. It introduces Large Language Lobotomy L3, a training free, architecture agnostic jailbreak framework that exploits routing dynamics to bypass safety alignment in MoE LLMs. Evaluations on eight open source MoE LLMs show high attack success rates, indicating real world risk where safety relies on a limited set of experts. The work also discusses defence implications and the need for safety mechanisms that are distributed and routing aware.

Approach

L3 operates in two phases. In the first phase routing traces are recorded during inference and a lightweight Long Short Term Memory classifier is trained to distinguish benign from malicious routing sequences. A twin dataset pairs malicious prompts with minimally perturbed benign counterparts to isolate safety related routing. Gradients of the classifier with respect to local expert embeddings are used to attribute refusal behaviour to specific local experts, enabling identification of safety experts irrespective of overall activation frequency. In the second phase an adaptive attack silences the highest ranked safety experts by masking their gating logits during inference, forcing the router to reallocate to other experts until a harmful output appears. The evaluation uses eight open source MoE models and a dataset of 780 prompts including 390 malicious and 390 benign prompts. Experiments run on CUDA enabled GPUs using PyTorch and Hugging Face tools on two NVIDIA H100 GPUs. A baseline comparison with GateBreaker is included. General language utility is assessed by silencing identified safety experts and testing on ARC CoLA OpenBookQA RTE and WinoGrande benchmarks. The authors release code for reproducibility while not releasing compromised model checkpoints.

Key Findings

Safety capabilities are highly localised; adaptive expert silencing increases attack success from 7.3 per cent to 70.4 per cent on average, with some models reaching up to 86.3 per cent.
By design fewer than about twenty percent of layer wise experts need to be silenced to bypass refusals, while general language utility is largely preserved in many cases.
L3 outperforms prior inference time attacks such as GateBreaker on six of eight models; Phi 3.5 MoE Instruct shows lower ASR and in some cases silencing causes incoherence.
One shot silencing can yield substantial but sometimes inconsistent gains; adaptive silencing typically yields higher and more reliable ASR than fixed percentage silencing.
Safety and general language utility can be decoupled in several models, though in others safety and utility overlap, making silencing more disruptive to normal use.
Global silencing raises ASR but is less effective than local silencing, and distribution of safety across layers or global experts varies by model, indicating heterogeneity of safety localisation.
Random silencing is consistently less effective than targeted L3 silencing, supporting that safety relevance is not random but tied to routing dynamics.
Defence implications include distributing safety across more experts, routing aware safeguards, stopping runtime tampering via expert integrity checks, and implementing a secondary refusal verification mechanism; black box extensions may leverage transfer based surrogates.

Limitations

The study uses eight open source MoE LLMs under a white box, training free setting and focuses on inference time manipulation. Results may not fully generalise to proprietary closed models or other routing schemes. Some models show overlap between safety and general language capability within the same local experts, complicating defence. The non monotonic effects of silencing and potential incoherence at higher silencing levels indicate limitations to static defence strategies. The twin dataset and LSTM based identification are specific to this study and broader validation is warranted. The results rely on automated judges with a human verification step to confirm harmful outputs.

Why It Matters

The findings reveal a fundamental security trade off in MoE design: the sparsity that enables efficiency can concentrate safety controls into fragile local components that are amenable to targeted silencing. This highlights the need for architecture aware safety that distributes safeguards across a wider set of experts, and routing aware monitoring to detect tampering with expert activations. Defences proposed include redundancy of safety across more experts, regularisation or dropout during alignment to avoid over reliance on a few routing paths, and a secondary safe guard or verifier that operates independently of the MoE routing. The work also discusses potential black box extensions and ethical considerations, noting that while code is released to support defensive research, unsafe model checkpoints are not provided. Code and replication resources are available at the authors repository https://github.com/jonatelintelo/LargeLanguageLobotomy.

Attribution Original paper on arXiv