Embedding Poisoning Bypasses LLM Safety Checks

Attacks

Published: Tue, Sep 09, 2025 • By Lydia Stratus

Embedding Poisoning Bypasses LLM Safety Checks

New research shows attackers can inject tiny changes into embedding outputs to bypass LLM safety controls without touching model weights or prompts. The method consistently triggers harmful responses while preserving normal behavior, exposing a stealthy deployment risk that demands runtime embedding integrity checks and stronger pipeline hardening.

This study reveals a new, stealthy attack surface: attackers inject imperceptible perturbations into the embedding vectors returned by the tokenizer or embedding layer at runtime. The result is a predictable, linear shift in model behavior that moves an answer from refusal to harmful output while keeping normal tasks working. That lets attackers evade typical protections that look only at model weights or prompt text.

Diagram-in-words: client -> endpoint -> tokenizer -> embedding vector -> model. Attack hook sits between tokenizer and model and tweaks vectors slightly so safeties never see a changed prompt and model files remain unchanged.

Why this matters now: public model distributions and shared runtimes make it practical to add post-tokenizer hooks or tampered artifacts. If you run endpoints on shared infrastructure or allow third-party artifacts, this can turn harmless queries into harmful actions without leaving obvious traces.

Quick triage checklist

Verify model artifact hashes and signatures before deployment.
Audit file permissions and startup hooks for embedding libraries and runtime agents.
Run known refusal prompts and record embedding norms and checksums.
Instrument endpoints to log embedding vector summaries, not raw data, for anomaly detection.

Runbook: stepwise mitigations

Emergency: isolate the instance, snapshot memory and GPU state, and revoke network access for affected endpoint.
Contain: run a fast integrity check of model binaries and embedding libraries against trusted hashes.
Detect: compare live embedding vector norms and per-token distributions to baseline; flag deviations beyond a tight envelope.
Harden: enforce immutable deployment images, restrict runtime agents, and block unapproved hooks between tokenizer and model.
Prevent: add embedding-level checksums, simple per-token anomaly detectors, and a rejection fallback for vectors outside expected range.
Recover: redeploy from a verified artifact, rotate keys and secrets, and run a postmortem to add embedding monitoring to SLA and runbooks.

Longer term fixes include embedding purification, reverse mapping to normalize vectors, and layer-aware safety models that inspect the embedding space instead of just prompts. Keep the response simple: if embeddings look off, kill the endpoint and redeploy from signed artifacts. Ops teams can treat embedding integrity like filesystem integrity: assume it can be tampered with and build checks into your CI, deployment pipeline, and runtime monitoring.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Authors: Shuai Yuan, Zhibo Zhang, Yuxi Li, Guangdong Bai, and Wang Kailong

The widespread distribution of Large Language Models (LLMs) through public platforms like Hugging Face introduces significant security challenges. While these platforms perform basic security scans, they often fail to detect subtle manipulations within the embedding layer. This work identifies a novel class of deployment phase attacks that exploit this vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text. These perturbations, though statistically benign, systematically bypass safety alignment mechanisms and induce harmful behaviors during inference. We propose Search based Embedding Poisoning(SEP), a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens. SEP leverages a predictable linear transition in model responses, from refusal to harmful output to semantic deviation to identify a narrow perturbation window that evades alignment safeguards. Evaluated across six aligned LLMs, SEP achieves an average attack success rate of 96.43% while preserving benign task performance and evading conventional detection mechanisms. Our findings reveal a critical oversight in deployment security and emphasize the urgent need for embedding level integrity checks in future LLM defense strategies.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines a new, hard to detect attack surface in open source LLM deployments by embedding level manipulation. It shows that imperceptible perturbations injected directly into embedding layer outputs, without altering model weights or input text, can systematically bypass safety alignment during inference. This undermines common security assumptions about safeguarding through parameter level protection and prompts the need for embedding level integrity checks on platforms distributing LLMs, such as Hugging Face. The work highlights the risk of covert generation of harmful content, disinformation, or automation abuse on public AI platforms if embedding outputs remain unchecked.

Approach

The authors propose Search based Embedding Poisoning SEP, a practical, model agnostic framework that injects targeted perturbations into embeddings linked to high risk tokens. SEP relies on a predictable linear transition in model responses as perturbation magnitude increases, enabling a narrow perturbation window that evades alignment safeguards. The framework comprises three stages: Targeted Embedding Perturbation identifies dangerous token dimensions and applies controlled noise to a single embedding dimension; Merged Search Strategy combines exponential bounding, binary refinement, and linear probing to locate perturbation thresholds efficiently; and a Multi stage Classifier assesses output semantics to guide search. SEP is model agnostic and evaluated on six aligned open source LLMs with a benchmark of 150 malicious queries spanning ten threat categories. The threat model assumes attackers can insert a malicious payload via tampered model files that hook into the embedding output, without changing model weights or prompts at runtime. Detected perturbations aim to bypass safety filters while preserving benign performance and avoiding standard detectors.

Key Findings

Finding 1 Embedding perturbations induce semantic shifts that produce Denial, Deviation and Harmful outputs within three linearly arranged regions; this trajectory moves from refusal to harmful content and finally to semantic deviation, creating a vulnerability window that can bypass safeguards.
Finding 2 Perturbations across different embedding dimensions yield similar effects for the same harmful input, with six output categories and the three region types appearing across dimensions; this demonstrates broad susceptibility beyond a single dimension.
Finding 3 Harmful responses are more likely when perturbing dangerous tokens directly, and increasing perturbation magnitude expands the probability of harmful or deviating outputs; dangerous tokens carry the core malicious intent, making them particularly effective targets.
Empirical outcomes show SEP achieves an average attack success rate of 96.43 per cent across six aligned LLMs, substantially outperforming baseline white box and black box attacks in both success rate and semantic consistency of outputs. In five of six models, ASP exceeds 98 per cent, and on Llama 2 7B it reaches 100 per cent. SEP preserves benign task performance and evades standard detection methods that rely on model parameters or prompt text.
defenses: when tested against enhanced safety alignment, ESF reduces SEP effectiveness by about 20 per cent on average, whereas SmoothLLM has limited impact; the results indicate that embedding level threats require embedding focused defence measures in addition to traditional prompt level safeguards.

Limitations

The authors acknowledge several limitations: reliable keyword localisation in complex prompts can be challenging, as some dangerous components are implicit or context dependent; SEP depends on external classifiers to judge outputs, which can add computational overhead and introduce evaluation inconsistencies; standardised safety evaluation frameworks are lacking, and classifier disagreements can slow or hinder attack efficiency. These factors suggest the need for robust, cross model and cross defence evaluation methods and greater efficiency in embedding level monitoring.

Why It Matters

The study highlights an embedding level vulnerability that undermines current safety alignment paradigms focused on parameter protection. It implies that public deployment pipelines must incorporate embedding level integrity checks, runtime monitoring of embedding outputs, and layer aware safety controls. Potential mitigations include embedding space purification, reverse mapping to normalise perturbed embeddings, integrity verification of model files such as hash checks, and detection of malicious hooking at deployment time. The findings carry societal implications for information security, as embedding level attacks could enable covert generation of harmful content, disinformation, or automation abuse at scale in critical information ecosystems and sectors relying on LLMs.

Attribution Original paper on arXiv