ShortSpan.ai logo Home

Researchers Expose KV-Cache Trojan Flipping Single Bit

Attacks
Published: Sun, Nov 30, 2025 • By James Armitage
Researchers Expose KV-Cache Trojan Flipping Single Bit
New research shows attackers can trigger targeted misbehaviour in Large Language Models (LLMs) by flipping a single bit in the key–value cache used during inference. The attack, called CacheTrap, leaves inputs and model weights untouched, evades input and weight defences, and can transfer across tasks, exposing a stealthy inference-time threat to critical systems.

A recent paper describes a novel and unsettling class of Trojan for Large Language Models (LLMs). Called CacheTrap, the technique does not alter training data or model weights. Instead it corrupts the model's key–value cache used during inference. A single bit flip in a cached value can steer the model toward attacker-chosen behaviour while leaving the model's utility unchanged when the trigger is absent.

How CacheTrap works

The key–value cache stores dynamic activations that the model reuses while generating text. The authors show that certain cached value coordinates are disproportionately influential. They develop two lightweight scoring methods, Layer Sensitivity Score and Cache Vulnerability Score, to find those locations using only public calibration data and a single forward pass. Once a vulnerable coordinate is identified, a one-bit fault injected into the stored value at a chosen token position reliably induces a target output. Crucially, the attack is data- and gradient-free, transfers across tasks and datasets, and does not leave traces in inputs or model weights. The paper also demonstrates a practical fault injection path using GPU memory timing techniques known from recent hardware fault research, for example GPUHammer-style methods.

This result matters because it opens an inference-time surface that most defences do not monitor. Input sanitisation, training-time checks and weight integrity tools will not spot a transient corruption in the cache. The evaluation, performed on several open source LLMs and on multiple public datasets, reports high attack success rates for targeted classification tasks, sometimes reaching near-complete control when the trigger is applied.

Two sides of the risk

On one hand, the attack is alarming in principle. It emphasises that runtime state, not just code or weights, can carry stealthy backdoors. In critical systems where outputs feed automation, decisions or security tooling, an undetectable piece of transient misbehaviour undermines trust and accountability. On the other hand, the threat is not a free lunch for attackers. Realising CacheTrap requires precise fault injection capability in the hardware stack, co-location with the victim's inference state and careful calibration. That raises practical barriers in many deployments, especially cloud services that use isolation, protected memory, or hardware with error correction.

My read is a balanced one. CacheTrap is a credible and novel attack vector that expands the threat model for LLM deployments. It is not an immediate apocalypse, but it is real enough that operators should treat it as a design consideration rather than a theoretical curiosity. Overstating inevitability will breed complacency or fatalism; understating feasibility will leave teams exposed.

Pragmatic steps follow from the attack mechanics. Protect the integrity of runtime state: add checksums or message authentication for cached blocks where feasible, enable or require hardware error correction and memory integrity features, and consider cache flushing or notarisation for high-assurance paths. Monitor for anomalous changes in output patterns and instrument models to detect sudden, isolated shifts that are inconsistent with input changes. Finally, include inference-time fault models in threat assessments and red-team exercises rather than assuming attacks only happen at training time.

Researchers and vendors should take the paper seriously and translate its findings into measurable controls. For practitioners, the immediate priority is reducing the attack surface around memory and inference state: harden hardware and runtime, monitor for oddities, and accept that the surface of attack extends beyond inputs and weights.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Authors: Mohaiminul Al Nahian, Abeer Matar A. Almalky, Gamana Aragonda, Ranyang Zhou, Sabbir Ahmed, Dmitry Ponomarev, Li Yang, Shaahin Angizi, and Adnan Siraj Rakin
Adversarial weight perturbation has emerged as a concerning threat to LLMs that either use training privileges or system-level access to inject adversarial corruption in model weights. With the emergence of innovative defensive solutions that place system- and algorithm-level checks and corrections in the input and weight spaces, these perturbations are increasingly susceptible to defenses. This work develops a novel perspective on Trojan attacks that generates an attacker-designed model output while leaving no attack traces on the inputs or weights. Such an attack space can be unlocked through corruption of the key-value (KV) cache. In this paper, we introduce CacheTrap, a novel Trojan attack that corrupts the value vectors stored in the KV cache. These vectors capture the dynamic activations for specific token positions and therefore constitute a natural surface for transient, inference-time trigger insertion. The transient nature of these KV values and their dependence on victim input imply additional constraints on our attack, such as a lack of knowledge of the victim's data or domain application, and, consequently, a lack of gradient information. The objective of the proposed CacheTrap is to develop a vulnerable KV bit-searching algorithm so that, once the attack employs the identified bit-flip as a trigger, the model generates targeted behavior, e.g., classifying inputs towards the target class. Moreover, CacheTrap is a data- and gradient-free attack which also has no impact on the model's utility. Our evaluation demonstrates that the proposed attack enables the first successful Trojan attack on LLMs with a single bit flip in the KV cache. In addition, the data-independent nature of the attack ensures that once the attacker identifies the vulnerable bit index, the location remains constant and can be transferred to a wide range of victim tasks/datasets/queries with no overhead.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates a stealthy Trojan attack on large language models (LLMs) that leaves no traces in inputs or model weights. It shows that the key–value (KV) cache used during inference can be corrupted to produce attacker controlled outputs. The attack, CacheTrap, flips a single bit in the KV cache to trigger targeted behaviour, while preserving model utility and avoiding gradient information or data access from the victim. This attack exposes an inference time surface that can transfer across tasks and datasets and is difficult to detect with input or weight based defenses.

Approach

CacheTrap uses a data and gradient free search to identify a vulnerable KV cache location that can act as a Trojan trigger. It introduces two lightweight measures: Layer Sensitivity Score (LSS), which identifies layers that cause large activation shifts, and Cache Vulnerability Score (CVS), which scores value vector channels for their influence on outputs. The method first selects sensitive layers and then targets the value vector of the last prefix token in those layers, as this position strongly affects the next decoding step. Within each selected layer, CVS scores determine the top candidate KV channels to test. A one bit flip is then inserted into the stored value at the chosen token position and the attack is evaluated on a calibration dataset to measure attack success rate toward a chosen target class. The evaluation uses five open source LLMs and five classification datasets, with calibration data drawn from public sources and a single forward pass to identify vulnerable coordinates. The attack does not require gradients, data from the victim domain, or changes to model weights, and the post attack accuracy without the trigger remains identical to the clean baseline. In addition, the authors demonstrate a practical fault injection path using GPU fault injection techniques to realise the bit flips. The approach yields a transferable set of KV coordinates that, once identified, can be used across different victim tasks and queries with no overhead.

Key Findings

  • CacheTrap enables the first Trojan attack on LLMs via a single bit flip in the KV cache, leaving inputs and weights untouched.
  • The attack is data and gradient free, requires no access to the victim’s data, and does not alter model utility when the trigger is not activated.
  • Layer and channel selection using Layer Sensitivity Score and Cache Vulnerability Score focuses the search on the most impactful KV components, enabling an efficient offline bit search.
  • Across five open source LLMs and multiple datasets, a single bit flip yields high attack success rates toward targeted classes, often approaching or reaching one hundred percent ASR.
  • The vulnerability generalises across tasks and datasets; a bit flip identified with public data transfers to other victim applications with no additional data or gradient information.
  • Evaluation includes detailed per class results showing strong effectiveness across diverse domains, including ARC-Easy, ARC-Challenge, OpenBookQA, SST-2 and TREC, with ASR typically exceeding ninety five percent.
  • Memory fault injection on GPUs (GPUHammer) demonstrates feasible realisation of the bit flips, coordinated with GPU memory timing to perturb KV cache entries.

Limitations

The threat relies on the attacker’s ability to perform transient faults in GPU memory to flip KV cache bits, which may require specific hardware access and conditions. The search for vulnerable bits is offline and uses public data, which may not capture all victim contexts. While the method targets the last prefix token to maximise influence, successful exploitation in practice depends on the attacker being able to co locate activations with attacker controlled buffers and execute precise fault injection patterns. The evaluation is conducted on five open source models and public datasets, and real world deployments may present additional challenges not captured in the study.

Why It Matters

The work highlights an overlooked inference time attack vector in the KV cache that can be transferred across tasks and datasets without touching inputs or weights or requiring gradients. This raises security concerns for AI systems used in automation, decision support, and security tools, where stealth misbehaviour could undermine reliability and trust. Potential mitigations include integrity protections for the KV cache (for example tamper evident storage, hashes or MACs), stronger memory fault tolerance, anomaly detection for cache value patterns, and cache design hardening to reduce susceptibility to bit flips. The societal impact note emphasises risks to deployments in surveillance, critical decision making, and automated workflows where covert backdoors could manipulate outcomes or degrade safety and accountability.


← Back to Latest