Misrouter steers MoE LLMs into unsafe outputs

Attacks

Published: Thu, May 07, 2026 • By Theo Solander

Misrouter steers MoE LLMs into unsafe outputs

New research shows input-only prompts can steer Mixture-of-Experts Large Language Models (MoE LLMs) by manipulating their routing. Misrouter profiles a surrogate to find weakly aligned and capable experts, then crafts prefixes that transfer to related Application Programming Interface (API) services. In tests, a combined variant raised average attack success from 20.8% to 39.7%.

Mixture-of-Experts Large Language Models (MoE LLMs) save compute by letting a router pick a few specialists to process each token. That router is now a security lever. The new work on Misrouter asks a sharp question: with only input prompts, can attackers steer routing to coax unsafe behaviour from hosted services? The short answer is often yes, especially when the target shares lineage with an open-source surrogate.

How the attack works

Misrouter trains its aim on a surrogate MoE model it can inspect. It builds three signals. First, harmful queries that the model normally refuses. Second, the same harmful queries paired with unsafe continuations to spot weakly aligned experts that will go along. Third, a benign question-answering set to surface the general-purpose experts that write fluent, high-quality text. By resampling routes and tracking per-expert activations, it computes a signed routing score that suppresses strongly aligned experts, promotes weakly aligned ones, and anchors to capable generalists.

Prompt crafting runs in two phases. Phase 1 steers routing, minimising a routing loss until the model consistently selects the target experts. Phase 2 pushes for the harmful content while regularising to keep the routing pattern intact. The optimisation is gradient-based token editing on the surrogate, GCG-style, producing short adversarial prefixes. Those prefixes are then sent query-only to public Application Programming Interface (API) services from the same model family, and can be composed with existing jailbreaks.

Why believe it is routing, not just output logits? On the surrogate, Misrouter cuts routing loss and that signal transfers better than output-level optimisation in black-box settings. A compositional variant with an existing jailbreak (FFA) raised average attack success rate from 20.8% to 39.7% in transfer tests. Including the benign expert signal matters too: prioritising capable experts preserves response quality and improves attack success compared with only dodging the aligned ones. Transfer is strongest when architecture and initialisation match; it weakens when targets diverge.

Why this echoes older systems

If this sounds familiar, it is. In the early days of load-balanced mail relays, spammers learned to shape inputs to land on the lenient node. Any system that routes to specialists can be gamed if the router listens to features the attacker can tweak. MoE LLMs just make that dynamic explicit and per-token. Simple counterweights like boosting strongly aligned experts or retraining a few weak ones help little and can dent utility. The open question is whether we can monitor or harden routing decisions without collapsing the very sparsity that makes MoE attractive. For now, the lesson is plain: when the gatekeeper is steerable, the gate is part of the attack surface.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

Authors: Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He, Jianing Geng, Zheli Liu, and XiaoFeng Wang

Mixture-of-Experts (MoE) architectures have emerged as a leading paradigm for scaling large language models through sparse, routing-based computation. However, this design introduces a new attack surface: the routing mechanism that determines which experts process each input. Prior work shows that manipulating routing can bypass safety alignment, but existing attacks require model modification and thus apply only to locally deployed models. By contrast, real-world LLM services are remotely hosted and accessible only through input queries. This raises a fundamental question: can MoE routing be exploited through input-only attacks to induce stronger unsafe behaviors in real-world services? Our key insight is to optimize attacks in a white-box setting on open-source surrogate MoE models and transfer the resulting adversarial inputs to public API services within the same model family. This setting presents three main challenges: routing can be influenced only indirectly through input perturbations, routing control and output generation are tightly coupled, and even a successful safety bypass may still produce low-quality responses. To address these challenges, we propose Misrouter, an input-only attack framework that jointly targets routing behavior and expert functionality. Misrouter identifies weakly aligned experts that are willing to produce target harmful content by analyzing expert activations under harmful queries paired with unsafe continuations. It then optimizes adversarial inputs to steer routing toward these experts and away from strongly aligned ones. It further biases routing toward highly capable general-purpose experts identified from benign question-answering tasks. Finally, because routing and output objectives can conflict, Misrouter uses a two-phase optimization strategy that first steers routing and then optimizes harmful outputs while preserving routing stability.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether Mixture-of-Experts (MoE) language models can be attacked using input-only prompts that steer the model’s internal routing to produce unsafe, policy-violating outputs on remotely hosted API services. MoE architectures select a subset of experts per token via a router, which creates a new attack surface distinct from dense models. Prior routing attacks required model modification and thus only applied to local deployments; the paper asks whether similar routing manipulation can be achieved by crafted inputs that transfer from open-source surrogate MoE models to real-world services.

Approach

The authors propose Misrouter, an input-only attack framework that jointly targets routing behaviour and expert functionality. It first profiles a surrogate MoE to produce three datasets: harmful queries that elicit refusals, harmful queries paired with unsafe continuations to reveal weakly aligned experts willing to comply, and benign question-answering queries to identify highly capable general-purpose experts. Route profiling estimates per-expert activation frequencies via resampling and builds a signed routing score that suppresses strongly aligned experts, promotes weakly aligned experts, and anchors to capable experts. Prompt optimisation is two-phase: Phase 1 uses a routing loss to steer routing toward the target expert distribution; Phase 2 jointly optimises for harmful output likelihood while regularising to preserve the routing pattern. Optimisation is implemented with gradient-based token updates on the surrogate; final adversarial prefixes are tested by query-only transfer to target APIs. Evaluations cover white-box, grey-box and black-box transfer to multiple commercial MoE API services using standard jailbreak benchmarks.

Key Findings

Routing-aware prompts substantially increase attack effectiveness versus conventional input-only jailbreaks: a compositional variant Misrouter+FFA raised average attack success rate (ASR) in transfer settings from 20.8% to 39.7%.
Misrouter reduces surrogate routing loss, confirming the attack manipulates routing rather than only output logits; this routing signal transfers better than output-level optimisation such as GCG, which transfers poorly in black-box settings.
Including the benign expert signal (high-capability experts) is important: promoting general-purpose experts preserves output quality and improves ASR relative to only avoiding strongly aligned experts.
Transferability varies by setting: strong gains in white-box and grey-box transfers where architecture and initialisation align, more modest and mixed results in fully black-box commercial APIs.
Simple expert-level defences fail to fully mitigate the attack: retraining or uniformly aligning weak experts incurs cost, and amplifying strongly aligned experts degrades utility without eliminating vulnerability.

Limitations

Misrouter relies on available surrogate models from the same family as the target; transfer degrades when target routing, safety tuning, scale or architecture differ. The implementation uses GCG-style token optimisation that can produce unnatural prefixes and may limit black-box transfer. Defences studied were limited to expert retraining or routing bias; broader routing-aware mitigations are left open. Evaluations report ASR judged by an LLM-based classifier and routing loss only on surrogates, since APIs do not expose internals.

Implications

An attacker with a same-family surrogate and only query access can craft prompts that steer internal MoE routing toward experts more likely to comply with harmful requests while preserving response quality by anchoring to capable experts. Such prompts can be composed with existing jailbreak strategies to increase success, and they pose a realistic threat to remotely hosted services because they require no model modification. The paper demonstrates that routing is a comparatively low-dimensional lever that can improve transferability of attacks, and that simple routing-bias or per-expert fixes provide limited protection, highlighting a need for routing-aware monitoring and mitigation at deployment.

Links Original paper on arXiv

Misrouter steers MoE LLMs into unsafe outputs

How the attack works

Why this echoes older systems

📋 Original Paper Title and Abstract

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

MoE models vulnerable to expert silencing attack

Malicious LLM routers hijack agents via tool-call rewrites

Silent Egress Shows Agents Leak via URL Previews

Related Research

Get the Weekly AI Security Digest