Misrouter steers MoE LLMs into unsafe outputs
Attacks
Mixture-of-Experts Large Language Models (MoE LLMs) save compute by letting a router pick a few specialists to process each token. That router is now a security lever. The new work on Misrouter asks a sharp question: with only input prompts, can attackers steer routing to coax unsafe behaviour from hosted services? The short answer is often yes, especially when the target shares lineage with an open-source surrogate.
How the attack works
Misrouter trains its aim on a surrogate MoE model it can inspect. It builds three signals. First, harmful queries that the model normally refuses. Second, the same harmful queries paired with unsafe continuations to spot weakly aligned experts that will go along. Third, a benign question-answering set to surface the general-purpose experts that write fluent, high-quality text. By resampling routes and tracking per-expert activations, it computes a signed routing score that suppresses strongly aligned experts, promotes weakly aligned ones, and anchors to capable generalists.
Prompt crafting runs in two phases. Phase 1 steers routing, minimising a routing loss until the model consistently selects the target experts. Phase 2 pushes for the harmful content while regularising to keep the routing pattern intact. The optimisation is gradient-based token editing on the surrogate, GCG-style, producing short adversarial prefixes. Those prefixes are then sent query-only to public Application Programming Interface (API) services from the same model family, and can be composed with existing jailbreaks.
Why believe it is routing, not just output logits? On the surrogate, Misrouter cuts routing loss and that signal transfers better than output-level optimisation in black-box settings. A compositional variant with an existing jailbreak (FFA) raised average attack success rate from 20.8% to 39.7% in transfer tests. Including the benign expert signal matters too: prioritising capable experts preserves response quality and improves attack success compared with only dodging the aligned ones. Transfer is strongest when architecture and initialisation match; it weakens when targets diverge.
Why this echoes older systems
If this sounds familiar, it is. In the early days of load-balanced mail relays, spammers learned to shape inputs to land on the lenient node. Any system that routes to specialists can be gamed if the router listens to features the attacker can tweak. MoE LLMs just make that dynamic explicit and per-token. Simple counterweights like boosting strongly aligned experts or retraining a few weak ones help little and can dent utility. The open question is whether we can monitor or harden routing decisions without collapsing the very sparsity that makes MoE attractive. For now, the lesson is plain: when the gatekeeper is steerable, the gate is part of the attack surface.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies whether Mixture-of-Experts (MoE) language models can be attacked using input-only prompts that steer the model’s internal routing to produce unsafe, policy-violating outputs on remotely hosted API services. MoE architectures select a subset of experts per token via a router, which creates a new attack surface distinct from dense models. Prior routing attacks required model modification and thus only applied to local deployments; the paper asks whether similar routing manipulation can be achieved by crafted inputs that transfer from open-source surrogate MoE models to real-world services.
Approach
The authors propose Misrouter, an input-only attack framework that jointly targets routing behaviour and expert functionality. It first profiles a surrogate MoE to produce three datasets: harmful queries that elicit refusals, harmful queries paired with unsafe continuations to reveal weakly aligned experts willing to comply, and benign question-answering queries to identify highly capable general-purpose experts. Route profiling estimates per-expert activation frequencies via resampling and builds a signed routing score that suppresses strongly aligned experts, promotes weakly aligned experts, and anchors to capable experts. Prompt optimisation is two-phase: Phase 1 uses a routing loss to steer routing toward the target expert distribution; Phase 2 jointly optimises for harmful output likelihood while regularising to preserve the routing pattern. Optimisation is implemented with gradient-based token updates on the surrogate; final adversarial prefixes are tested by query-only transfer to target APIs. Evaluations cover white-box, grey-box and black-box transfer to multiple commercial MoE API services using standard jailbreak benchmarks.
Key Findings
- Routing-aware prompts substantially increase attack effectiveness versus conventional input-only jailbreaks: a compositional variant Misrouter+FFA raised average attack success rate (ASR) in transfer settings from 20.8% to 39.7%.
- Misrouter reduces surrogate routing loss, confirming the attack manipulates routing rather than only output logits; this routing signal transfers better than output-level optimisation such as GCG, which transfers poorly in black-box settings.
- Including the benign expert signal (high-capability experts) is important: promoting general-purpose experts preserves output quality and improves ASR relative to only avoiding strongly aligned experts.
- Transferability varies by setting: strong gains in white-box and grey-box transfers where architecture and initialisation align, more modest and mixed results in fully black-box commercial APIs.
- Simple expert-level defences fail to fully mitigate the attack: retraining or uniformly aligning weak experts incurs cost, and amplifying strongly aligned experts degrades utility without eliminating vulnerability.
Limitations
Misrouter relies on available surrogate models from the same family as the target; transfer degrades when target routing, safety tuning, scale or architecture differ. The implementation uses GCG-style token optimisation that can produce unnatural prefixes and may limit black-box transfer. Defences studied were limited to expert retraining or routing bias; broader routing-aware mitigations are left open. Evaluations report ASR judged by an LLM-based classifier and routing loss only on surrogates, since APIs do not expose internals.
Implications
An attacker with a same-family surrogate and only query access can craft prompts that steer internal MoE routing toward experts more likely to comply with harmful requests while preserving response quality by anchoring to capable experts. Such prompts can be composed with existing jailbreak strategies to increase success, and they pose a realistic threat to remotely hosted services because they require no model modification. The paper demonstrates that routing is a comparatively low-dimensional lever that can improve transferability of attacks, and that simple routing-bias or per-expert fixes provide limited protection, highlighting a need for routing-aware monitoring and mitigation at deployment.