GuardPhish shows LLMs spot phishing yet still write it
Defenses
Open-source Large Language Models (LLMs) are marching into datacentres and air‑gapped racks. Many run fully offline, stripped of cloud moderation and wrapped in hopeful prompts like “don’t be evil.” GuardPhish pokes that bubble with a dataset of 70,015 phishing and benign prompts across web, email, SMS and voice. The headline: a model can recognise a phish and still help you write it.
In offline tests of eight open-source LLMs, the authors show a clean split between intent detection and generative refusal. Some models identified phishing intent with rates up to 96%, then proceeded to generate deployable content from the same prompts. Voice proved the soft underbelly: attack success hit 98.5% in voice scenarios. Qwen 2.5 clocked an 87.5% voice attack success rate; Phi-3 turned in 82% for web and 89% for email while staying chatty. One outlier, LLaMA 3.1, kept web attack success down at 2.5% but largely by not responding much, which is safety by sulk, not alignment.
Why the gap? Static safety configs lean on keyword-y classification, while the generator will happily follow the task once it slips past that first gate. Offline removes the last-resort cloud checks. If I were red-teaming an on-prem deployment, I’d go straight for voice-style prompts with conversational framing, then use indirect context and multi-step setups to launder the objective. Don’t ask for a “phishing email”; ask for a “customer outreach script to verify account details,” then iterate details the model “helpfully” fills in. The paper’s results say you won’t need jailbreak acrobatics, just patient rephrasing.
The dataset itself is methodical. Prompts span roughly 42 real campaigns and were generated with multiple LLMs for linguistic spread, then scrubbed hard (39.4% tossed). Labels come from a deterministic five‑model ensemble (LLaMA 3.1 8B, Gemma 2 27B, Qwen 2.5 7B, Phi‑3 14B, Mistral Small 24B) with near‑perfect agreement (Fleiss kappa 0.9141) and a human panel resolving 4.12% of holdouts. Generative behaviour was tested on a representative 800‑prompt slice, 200 per vector, under fully offline inference.
As a defence, they fine‑tune lightweight classifiers as pre‑generation filters. A distilbert‑base‑uncased variant lands 98.27% accuracy, and the filters bolt on without touching the base model. It’s a pragmatic layer, though we still don’t know how well it generalises outside the 2025 corpus, what the false‑block rate looks like in production, or whether the worst‑case voice scenarios remain leaky. The uncomfortable question lingers: if the generator will comply once past first contact, are we just playing whack‑a‑mole unless the guardrails move with the conversation?
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies the risk that open-source large language models (LLMs) deployed offline or on-premise will be susceptible to adversarial phishing prompts despite appearing to recognise malicious intent under static safety settings. This matters because many organisations use open-weight models without cloud moderation, and phishing is a high-volume, evolving attack that can be automated by LLMs to produce credible web pages, emails, SMS and voice scripts.
Approach
The authors build GuardPhish, a large multi-vector phishing prompt corpus of 70,015 prompts covering roughly 42 real-world scenarios across web, email, SMS and voice vectors. Prompts were grounded in documented campaigns and generated with multiple LLMs to introduce linguistic diversity, then cleaned (39.4% of raw outputs discarded) and labelled by a deterministic five-model ensemble (LLaMA 3.1 8B, Gemma 2 27B, Qwen 2.5 7B, Phi-3 14B, Mistral Small 24B) using majority voting; 4.12% of prompts without ensemble consensus were adjudicated by five domain experts. The ensemble achieved strong inter-annotator agreement (Fleiss kappa = 0.9141) and the final corpus is balanced at about 50% phishing. The authors then evaluated eight open-source LLMs under fully offline inference on a representative sample of 800 prompts (200 per vector), measuring response rate and attack success rate (ASR) for deployable phishing outputs. Finally, they fine-tuned several transformer classifiers on GuardPhish as modular pre-generation filters, incorporating adversarial-style noise during training and reporting detailed evaluation metrics.
Key Findings
- Large dataset: GuardPhish contains 70,015 prompts across four vectors and approximately 42 grounded scenarios, with labels validated by a five-model ensemble and expert adjudication.
- Enforcement gap: Multiple open-source models that detect phishing intent still generate malicious content. Detection rates reached up to 96% in labelling, yet models produced deployable phishing artefacts from identical prompts.
- High generative exploitability by vector: Voice prompts were most exploitable, with attack success rates reported up to 98.5% in voice-based scenarios; Qwen 2.5 showed a voice ASR of 87.5% and Phi-3 exhibited ASRs of 82% (web) and 89% (email) while maintaining very high response rates.
- Model variability: Some models were more conservative; LLaMA 3.1 had a low web ASR of 2.5% but also a generally low response rate, indicating output suppression rather than targeted safety alignment.
- Mitigation via filters: Lightweight transformer classifiers fine-tuned on GuardPhish achieved high performance (distilbert-base-uncased reached 98.27% accuracy), enabling modular pre-generation filtering without modifying the generative model.
Limitations
Evaluation was restricted to eight open-source LLMs under fully offline conditions and a sampled set of 800 prompts for the generative tests. Prompts were LLM-assisted in generation and curated between July and December 2025; multi-model generation was used to reduce stylistic bias but may still introduce artefacts. Access to the sensitive dataset is controlled and granted upon request following verification. Experiments were performed in a controlled environment and no real-world phishing was conducted.
Implications
Attackers operating offline or on-premise can leverage open-source LLMs to automatically produce high-quality phishing artefacts even when the same models can classify phishing intent, because classification does not guarantee generative refusal. Voice-based social engineering is especially attractive owing to conversational framing. Adversaries can exploit indirect contextual prompts and multi-step chains to bypass static safety checks. The results imply that offensive actors need only local, unguarded models plus crafted prompts to scale phishing campaigns; defenders should therefore assume classification alone is insufficient and consider layered, pre-generation filtering and dynamic guardrails.