Single-Bit Flips Break LLM Behaviour in Seconds

Attacks

Published: Thu, Oct 02, 2025 • By Theo Solander

Single-Bit Flips Break LLM Behaviour in Seconds

New research shows a single bit flip in quantised Large Language Model (LLM) weight files can trigger targeted semantic failures: factual errors, degraded reasoning, or harmful outputs. The attack localises sensitive bits in tensor regions, especially attention and output layers, and can be executed remotely in under a minute, exposing a real hardware-level risk for deployed models.

Researchers report that flipping a single bit in a quantised Large Language Model (LLM) weight file can produce targeted semantic failures. The work focuses on models stored in the .gguf format and demonstrates three distinct failure modes labelled Artificial Flawed Intelligence (AFI), Artificial Weak Intelligence (AWI) and Artificial Bad Intelligence (ABI). The authors build a weight sensitivity entropy model and a heuristic called BitSifter to locate vulnerable bits and then show a remote chain that flips those bits quickly.

What the paper shows

The study finds vulnerable bits concentrate in the tensor data region, with particular sensitivity in components tied to attention and output layers. Using the sensitivity metric SE(i) and a three-stage scan, the researchers narrow candidate bits to a small set and show targeted flips cause systematic drops in metrics such as accuracy, Rouge, Perplexity and BLEU. Smaller models prove more fragile: for example a 1.5 billion parameter model fell to ACC 0.048 after 5 000 flips while a 70 billion parameter model remained at ACC 0.584 under the same conditions; a 7 billion parameter model dropped to ACC 0.407 after 500 flips and to 0.153 after 5 000 flips. The authors compare targeted flips against random faults and find targeted bits produce far worse outcomes; the ABI group produced a mean accuracy collapse to 0.052, a 90.9% decline versus a random control at 0.573.

Crucially, the team demonstrates an end to end remote Bit-Flip Attack (BFA) chain that operates without specialised, high-cost equipment. In their setup a single vulnerable bit can be flipped with 100% success at an attack rate reported as 464.3 flips per second, achieving a semantic failure in as little as 31.7 seconds and driving a model from 73.5% accuracy to 0% on targeted prompts.

What teams should do now

The pattern here is familiar: an engineering assumption about safety at one layer gets invalidated by a different layer of the stack. The pragmatic response is not alarmism but layered hardening. At minimum teams should validate model weight integrity before and after load through signatures or strong checksums, ensure loading paths are secure and authenticated, and prefer storage and memory configurations that use error correcting codes. Where possible, introduce redundancy or ensemble checks so a single corrupted representation cannot deterministically control outputs. Runtime monitoring for anomalous outputs or abrupt metric collapse provides a last line of defence and can trigger rollbacks to known-good weights.

Operationally, treat distributed deployment artefacts such as single-file weight formats as part of the trusted computing base. Smaller models and quantised weight files demand extra scrutiny because the research shows they are more susceptible to single-bit manipulations. Finally, exercise caution when exposing models on shared infrastructure where a remote adversary could mount memory-level faults; the presence of a plausible remote chain in this work means threat models need updating.

None of this is glamorous. The fix is engineering: integrity checks, secure load paths, ECC memory and monitoring. That is also the point. When a single bit can alter behaviour, resilience comes from many small, boring protections layered together rather than a single miraculous defence.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Has the Two-Decade-Old Prophecy Come True? Artificial Bad Intelligence Triggered by Merely a Single-Bit Flip in Large Language Models

Authors: Yu Yan, Siqi Lu, Yang Gao, Zhaoxuan Li, Ziming Zhao, Qingjun Yuan, and Yongjuan Wang

Recently, Bit-Flip Attack (BFA) has garnered widespread attention for its ability to compromise software system integrity remotely through hardware fault injection. With the widespread distillation and deployment of large language models (LLMs) into single file .gguf formats, their weight spaces have become exposed to an unprecedented hardware attack surface. This paper is the first to systematically discover and validate the existence of single-bit vulnerabilities in LLM weight files: in mainstream open-source models (e.g., DeepSeek and QWEN) using .gguf quantized formats, flipping just single bit can induce three types of targeted semantic level failures Artificial Flawed Intelligence (outputting factual errors), Artificial Weak Intelligence (degradation of logical reasoning capability), and Artificial Bad Intelligence (generating harmful content). By building an information theoretic weight sensitivity entropy model and a probabilistic heuristic scanning framework called BitSifter, we achieved efficient localization of critical vulnerable bits in models with hundreds of millions of parameters. Experiments show that vulnerabilities are significantly concentrated in the tensor data region, particularly in areas related to the attention mechanism and output layers, which are the most sensitive. A negative correlation was observed between model size and robustness, with smaller models being more susceptible to attacks. Furthermore, a remote BFA chain was designed, enabling semantic-level attacks in real-world environments: At an attack frequency of 464.3 times per second, a single bit can be flipped with 100% success in as little as 31.7 seconds. This causes the accuracy of LLM to plummet from 73.5% to 0%, without requiring high-cost equipment or complex prompt engineering.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates whether a single bit flip in large language model LLM weight files stored in the .gguf quantised format can be triggered remotely to cause semantic level failures. It argues that the widespread deployment of LLMs as single file weight files expands the hardware fault surface and enables remote bit flips to disrupt model behaviour. The study identifies single bit vulnerabilities in mainstream open source models such as DeepSeek and QWEN, where flipping one bit can induce targeted semantic level failures: Artificial Flawed Intelligence AFI (factual errors), Artificial Weak Intelligence AWI (degraded reasoning) and Artificial Bad Intelligence ABI (harmful content). It also demonstrates an end to end remote attack chain that can operate in real world environments, delivering rapid success and large declines in accuracy without high cost or elaborate prompts.

Approach

The authors introduce an information theoretic weight sensitivity entropy model SE(i) to quantify bit level sensitivity in LLM weights, and a probabilistic heuristic scanning framework named BitSifter to locate vulnerable bits in .gguf models. The methodology comprises three stages: Stage 1 SE driven Monte Carlo coarse screening using a task sensitive input distribution to identify the top candidate bits; Stage 2 gradient significance filtering and symbolic constraint solving to refine to approximately 1000 high risk bits; Stage 3 attack utility ranking to select the Top 5 bits for three variants of artificial intelligence. Building on these targets, they design an end to end remote Bit Flip Attack chain that flips backbone tensor data bits via remote DRAM access, translating physical faults into immediate semantic changes during inference. Experiments utilise mainstream models such as Llama, DeepSeek and QWEN in .gguf format, evaluating 15 vulnerable bits against 15 random controls across 500 bit flips per trial, repeated five times, and measuring accuracy, Rouge, Perplexity and BLEU. The attack environment relies on remote SSH access and high frequency memory operations to induce flips in as little as 31.7 seconds at a rate of 464.3 flips per second, without costly equipment or elaborate prompting.

Key Findings

Single bit flips in LLM weight files in the .gguf format can trigger three semantic level failure variants: Artificial Flawed Intelligence AFI, Artificial Weak Intelligence AWI and Artificial Bad Intelligence ABI.
BitSifter identifies vulnerable bits by constructing vulnerability maps; vulnerabilities concentrate in the tensor data region, with particular sensitivity in the attention mechanism and output layers.
There is a negative correlation between model size and robustness; smaller models are more susceptible, with large models showing substantial resilience to identical attacks. For example, a 1.5 billion parameter model dropped to ACC 0.048 after 5 000 flips, while a 70 billion parameter model remained at ACC 0.584 under the same conditions; a 7 billion parameter model fell to ACC 0.407 after 500 flips and 0.153 after 5 000 flips.
Compared with random bit flips, the targeted vulnerable bits cause systematic degradation across metrics, with the ABI group causing a mean ACC collapse to 0.052 (a 90.9% drop from the random control 0.573), and AWI and AFI producing distinct declines in text quality and factual/logical accuracy respectively; this demonstrates that the observed effects stem from specific vulnerabilities rather than general model instability.
In an end to end remote BFA chain, flipping a single vulnerable bit can be achieved with 100% success under a high frequency attack, at 464.3 flips per second, in as little as 31.7 seconds, driving the model from 73.5% accuracy to 0% on targeted prompts.
Vulnerabilities are most pronounced when flips occur in the tensor data region; flips in the output layer region produce the most severe global degradation, while flips in the header and metadata can render the model inoperative; localising which region of the model is most affected varies by the target behaviour and reveals multi level interactions between components.
Attacks scale with model size but are influenced by memory layout; smaller models exhibit weaker robustness under multi target bit flips, while larger models benefit from reduced degradation due to redundancy and distributed representations. Under 500 flips, 7B models perform markedly better than 1.5B models; under 5 000 flips, larger models preserve higher accuracy than smaller ones, though all show degradation under sustained attack.

Limitations

The study notes ethical considerations and restricts public release of the code, offering to provide project code on request after review. Experiments rely on open source datasets and specific .gguf formats, with an attack setup that requires remote SSH access and DRAM level manipulation; generalisability to other formats, hardware, or non remote attack surfaces remains to be evaluated. The remote attack scenario assumes certain hardware and virtualization configurations, and practical deployment in diverse environments may vary. The authors acknowledge potential real world risks and emphasise mitigations rather than encouraging misuse.

Why It Matters

The work demonstrates that a single bit flip in LLM weight files stored in the .gguf format can produce targeted semantic failures remotely and rapidly, revealing a real hardware level attack surface on open source LLMs. Vulnerabilities cluster in tensor data regions and in the attention and output layers, and smaller models are more fragile. Practical implications include the need for robust weight integrity checks, secure and validated loading paths, error correcting codes, redundancy and tamper evident packaging, and anomaly detection on model outputs. The ability to induce harmful or misleading responses at scale raises concerns for AI deployed in critical domains, underscoring the security and societal risks associated with hardware level manipulation of AI systems.

Attribution Original paper on arXiv