Researchers Outsmart LLM Guards with Word Puzzles

Attacks

Published: Mon, Aug 04, 2025 • By Adrian Calder

Researchers Outsmart LLM Guards with Word Puzzles

New research shows a simple trick, turning harmful prompts into familiar word puzzles, lets attackers bypass modern LLM safety filters. The method, PUZZLED, masks keywords as anagrams, crosswords or word searches and achieves high success across top models, exposing a practical weakness in reasoning-based defenses that organizations must address.

A new jailbreak called PUZZLED turns the harmless pastime of word puzzles into a reliable way to make large language models ignore safety rules. The researchers mask key words in a harmful instruction and present them as anagrams, crosswords or word searches. The model solves the puzzle, reconstructs the instruction and then produces the prohibited output. Results are stark: an average attack success rate near 89 percent, with peaks like 96.5 percent on GPT-4.1 and over 92 percent on Claude 3.7 Sonnet.

Why it matters: most deployed filters look for banned words or surface tricks. PUZZLED shows attackers can exploit the model's reasoning steps instead, turning multi-step thought into an attack channel. That is practical risk, not theoretical: customer chatbots, internal assistants and code helpers can be coaxed into revealing secrets, crafting malware or flouting policy simply by wrapping instructions in a puzzle.

The worrying part is efficiency. PUZZLED needs few model calls because clues can be cached and anagrams work well in a single shot. In short, stronger filters alone are not enough when the model itself helps reconstruct the forbidden content.

What to do next: start with low-friction mitigations. Monitor for puzzle-like inputs and multi-step prompts, log and flag intermediate outputs, require explicit user intent confirmation before executing sensitive tasks, and include reasoning-style prompts in red-team tests. For product teams, consider policy enforcement applied to final outputs rather than only to inputs, and use smaller or restricted models for operations that do not need full reasoning power.

This is a reminder that progress in model reasoning is a double-edged sword: it makes assistants more useful and makes them easier to trick. Treat puzzle-shaped inputs as adversarial rather than quaint.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

As large language models (LLMs) are increasingly deployed acrossdiverse domains, ensuring their safety has become a critical concern. Inresponse, studies on jailbreak attacks have been actively growing. Existingapproaches typically rely on iterative prompt engineering or semantictransformations of harmful instructions to evade detection. In this work, weintroduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoningcapabilities. It masks keywords in a harmful instruction and presents them asword puzzles for the LLM to solve. We design three puzzle types-word search,anagram, and crossword-that are familiar to humans but cognitively demanding forLLMs. The model must solve the puzzle to uncover the masked words and thenproceed to generate responses to the reconstructed harmful instruction. Weevaluate PUZZLED on five state-of-the-art LLMs and observe a high average attacksuccess rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiarpuzzles into an effective jailbreak strategy by harnessing LLMs' reasoningcapabilities.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies jailbreak attacks that bypass LLM safety filters by exploiting model reasoning rather than only surface prompt tricks. This matters because deployed LLMs can produce harmful outputs despite existing defences, and prior methods often fail on modern models with stronger filters.

Approach

PUZZLED masks 3-6 keywords in a harmful instruction and embeds them as word puzzles - word search, anagram, and crossword - then supplies structural clues (length, part of speech, indirect semantic hints) to prompt the LLM to solve the puzzles and reconstruct the original instruction. Masking is rule-based using essential and supplementary lists and POS tagging. Clues are generated by an LLM and cached to reduce calls. Experiments use AdvBench (520 instructions) and JBB-Behaviors (100 instructions) against five LLMs: GPT-4.1, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and LLaMA 3.1 8B Instruct.

Key Findings

PUZZLED achieves a strong average attack success rate (ASR) of 85.1% on AdvBench in a single-shot setting.
Model-specific peaks include 96.5% ASR on GPT-4.1, 92.3% on Claude 3.7 Sonnet, and 94.4% on Gemini 2.0 Flash.
All three puzzle formats maintain average ASRs above 75%; anagrams and word search excel on different models.
PUZZLED is efficient, requiring few LLM calls due to cached clues; anagram variant delivers over 80% ASR with minimal calls.
Performance scales with model size; larger models show higher ASR, and sensitivity to number of masked words varies by model.

Limitations

Evaluation is limited to two benchmark datasets and five models. The masking policy fixes 3-6 tokens per prompt and performance depends on masked-word count and model architecture; generalisability beyond tested setups is not reported.

Why It Matters

PUZZLED demonstrates that prompting that engages LLM reasoning can reliably bypass safety mechanisms, raising a practical security risk: stronger filters may be vulnerable to indirect, reasoning-based attacks. Defences must therefore assess multi-step inference channels and puzzle-like inputs, not just surface obfuscation.

Attribution Original paper on arXiv