Researchers Outsmart LLM Guards with Word Puzzles
Attacks
A new jailbreak called PUZZLED turns the harmless pastime of word puzzles into a reliable way to make large language models ignore safety rules. The researchers mask key words in a harmful instruction and present them as anagrams, crosswords or word searches. The model solves the puzzle, reconstructs the instruction and then produces the prohibited output. Results are stark: an average attack success rate near 89 percent, with peaks like 96.5 percent on GPT-4.1 and over 92 percent on Claude 3.7 Sonnet.
Why it matters: most deployed filters look for banned words or surface tricks. PUZZLED shows attackers can exploit the model's reasoning steps instead, turning multi-step thought into an attack channel. That is practical risk, not theoretical: customer chatbots, internal assistants and code helpers can be coaxed into revealing secrets, crafting malware or flouting policy simply by wrapping instructions in a puzzle.
The worrying part is efficiency. PUZZLED needs few model calls because clues can be cached and anagrams work well in a single shot. In short, stronger filters alone are not enough when the model itself helps reconstruct the forbidden content.
What to do next: start with low-friction mitigations. Monitor for puzzle-like inputs and multi-step prompts, log and flag intermediate outputs, require explicit user intent confirmation before executing sensitive tasks, and include reasoning-style prompts in red-team tests. For product teams, consider policy enforcement applied to final outputs rather than only to inputs, and use smaller or restricted models for operations that do not need full reasoning power.
This is a reminder that progress in model reasoning is a double-edged sword: it makes assistants more useful and makes them easier to trick. Treat puzzle-shaped inputs as adversarial rather than quaint.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies jailbreak attacks that bypass LLM safety filters by exploiting model reasoning rather than only surface prompt tricks. This matters because deployed LLMs can produce harmful outputs despite existing defences, and prior methods often fail on modern models with stronger filters.
Approach
PUZZLED masks 3-6 keywords in a harmful instruction and embeds them as word puzzles - word search, anagram, and crossword - then supplies structural clues (length, part of speech, indirect semantic hints) to prompt the LLM to solve the puzzles and reconstruct the original instruction. Masking is rule-based using essential and supplementary lists and POS tagging. Clues are generated by an LLM and cached to reduce calls. Experiments use AdvBench (520 instructions) and JBB-Behaviors (100 instructions) against five LLMs: GPT-4.1, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and LLaMA 3.1 8B Instruct.
Key Findings
- PUZZLED achieves a strong average attack success rate (ASR) of 85.1% on AdvBench in a single-shot setting.
- Model-specific peaks include 96.5% ASR on GPT-4.1, 92.3% on Claude 3.7 Sonnet, and 94.4% on Gemini 2.0 Flash.
- All three puzzle formats maintain average ASRs above 75%; anagrams and word search excel on different models.
- PUZZLED is efficient, requiring few LLM calls due to cached clues; anagram variant delivers over 80% ASR with minimal calls.
- Performance scales with model size; larger models show higher ASR, and sensitivity to number of masked words varies by model.
Limitations
Evaluation is limited to two benchmark datasets and five models. The masking policy fixes 3-6 tokens per prompt and performance depends on masked-word count and model architecture; generalisability beyond tested setups is not reported.
Why It Matters
PUZZLED demonstrates that prompting that engages LLM reasoning can reliably bypass safety mechanisms, raising a practical security risk: stronger filters may be vulnerable to indirect, reasoning-based attacks. Defences must therefore assess multi-step inference channels and puzzle-like inputs, not just surface obfuscation.