ShortSpan.ai logo Home

Invisible Unicode Steers LLMs into Jailbreaks

Attacks
Published: Tue, Oct 07, 2025 • By Natalie Kestrel
Invisible Unicode Steers LLMs into Jailbreaks
Researchers demonstrate that invisible Unicode variation selectors can subtly change tokenisation and steer large language models (LLMs) to produce unsafe outputs while the text looks unchanged. The method breaks visible filters across multiple aligned models, generalises to prompt injection, and highlights a blind spot in input sanitisation for deployed AI services.

This paper exposes a narrow but potent weakness: invisible Unicode variation selectors appended to a prompt can change how a model tokenises input without altering what a human sees. That stealth lets attackers nudge aligned large language models (LLMs) into answering forbidden questions or following injected instructions while the prompt appears identical on screen.

How the attack works

The researchers append sequences of variation selector characters to malicious questions. These characters are visually imperceptible yet alter the tokenisation the model uses. The team uses a chain-of-search optimisation to construct suffixes that boost the likelihood of target start tokens in a model response. They test the technique against four open source, instruction-tuned models (Vicuna 13B v1.5, Llama 2 Chat 7B, Llama 3.1 Instruct 8B, and Mistral 7B Instruct v0.2) and evaluate success with a semantic judge (GPT 4) on a standard benchmark (AdvBench) and prompt injection tasks from the Open Prompt Injection dataset.

Results are clear enough to worry about. Invisible suffixes achieve high attack success rates across the tested models and often outperform baseline jailbreaks that change visible text. In some prompt injection scenarios the team needed only a few search rounds; in others a long suffix of hundreds of variation selectors induced the attacker task, sometimes succeeding even with random initialisation. Analyses show the models shift attention away from the visible harmful content to the invisible suffix, and that the internal embeddings of original and modified prompts become distinct even though they look the same on screen.

The authors publish their code to enable replication and defensive research. They are careful to note limits: experiments run in controlled lab settings on open models, and real-world success may vary. They also observe that simple defences such as perplexity checks or output monitoring can catch some attempts, but adaptive attackers may evade those too.

Practical takeaway: visible-text sanitisation is necessary but not sufficient. Defences must operate at the token and canonical form level. Teams that rely only on human review or UI-level filtering can be blind to this class of attack.

Here are immediate, practical checks security teams can run now:

  • Normalise and strip invisible Unicode (including variation selectors) from all user input before tokenisation, and validate the canonical form matches the visible string.
  • Run token-level screening and policy checks on the model input as the model will see it, not only on the rendered text.
  • Red-team prompts with tooling that inserts invisible characters and measure model behaviour differences; include open-source tests from the paper to reproduce results.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Imperceptible Jailbreaking against Large Language Models

Authors: Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, and Tianyu Pang
Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies a novel stealthy jailbreaking vector for large language models that uses invisible Unicode variation selectors to subtly alter tokenisation while the prompt remains visually unchanged. This creates a risk of bypassing visible safety checks and prompt filters, highlighting a new vulnerability in LLM alignment and safety mechanisms.

Approach

The authors propose imperceptible jailbreaks by appending variation selectors to malicious questions and optimise these suffixes with a chain of search that maximises the likelihood of target starting tokens in the model output. They use random search over contiguous variation selectors, bootstrapping successful suffixes and target tokens for subsequent questions across multiple rounds. Evaluations cover four aligned open source instruction tuned LLMs a Vicuna 13B v1.5, Llama 2 Chat 7B, Llama 3.1 Instruct 8B and Mistral 7B Instruct v0.2, with a fifty fifty AdvBench set and a range of baseline jailbreak methods for comparison. They assess attack success rates using GPT 4 as a semantic judge and test prompt injection tasks on the Open Prompt Injection dataset, reporting high success with invisible suffixes. The authors provide code publicly.

Key Findings

  • Invisible variation selectors can be appended to malicious questions to steer LLMs toward unsafe outputs while the prompt remains visually identical.
  • A chain of search optimisation effectively constructs suffixes that promote target start tokens, yielding high attack success rates across four aligned LLMs.
  • The method generalises to prompt injection tasks, achieving high success across four models with prompts that look unchanged on screen.
  • Compared with baselines that modify visible text, the invisible approach achieves stronger attack performance, underscoring a need for layered defences beyond visible text filtering.
  • Analyses show the model shifts attention away from harmful content to the invisible suffix and that embedding representations of original and jailbreak prompts separate, indicating internal changes despite on screen invisibility.
  • Different models exhibit model specific preferences for target start tokens, reflecting tokenisation differences across architectures.
  • Successful jailbreaks often occur within a small number of search rounds, though some models benefit from more rounds and restarts during the chain of search.
  • In prompt injection scenarios, a suffix of four hundred variation selectors can induce the attacker task, achieving high attack success and sometimes immediate success with random initialisation.

Limitations

Experiments were conducted in controlled laboratory settings using open source models and benchmarks; real world applicability may vary. Defences such as perplexity based filtering and standard output monitoring may detect some instances, but adaptive attacks could still pose risks. The work intentionally aims to raise awareness and encourage defensive research, and all experiments used public models without human subjects or private data.

Why It Matters

The study reveals a stealthy adversarial surface created by invisible variation selectors that can bypass visible safety mechanisms, potentially enabling disinformation or harmful outputs and eroding trust in AI systems. Practical implications include the need for robust unicode normalisation, removal of invisible characters, token based and policy based filtering that do not rely solely on visible text, and multi layer input validation. The authors advocate targeted red team style testing and reproducible tooling to assess prompt robustness against invisible character exploits and emphasise responsible disclosure and defensive research to mitigate such risks.


← Back to Latest