Harden LLMs to Detect and Block Phishing

Defenses

Published: Sat, Dec 13, 2025 • By Rowan Vale

Harden LLMs to Detect and Block Phishing

A new study shows Large Language Models (LLMs) can detect phishing emails with high accuracy in tests but break under coordinated adversarial and multilingual attacks. The research bills LLM-PEA as a multi-vector defence framework and stresses that prompt design, input sanitisation and ensemble checks are essential to keep LLM-based email security reliable in real-world settings.

A Large Language Model (LLM) is a system trained to predict and generate text by learning patterns from large amounts of language data. Prompt injection is when an attacker embeds instructions or content designed to override or confuse a model's intended behaviour.

The paper LLM-PEA evaluates whether frontier LLMs can detect phishing emails and how they fail when attacked. The authors feed normalised email content into three LLMs and test three prompt styles while simulating deliberate attacks: instruction injection, text refinement that rewrites phishing content to be subtler, and multilingual transformations. The dataset mixes legitimate and phishing samples and includes adversarially altered and translated messages.

Results are blunt and useful. On balanced data, one model hits about 95 per cent accuracy, another about 94 per cent and a third about 88 per cent. Zero shot prompts sometimes outperform structured, and chain of thought prompts vary a lot by model. That means a well-chosen prompt can matter as much as model choice. But the strengths come with obvious weaknesses. Adversarial refinement erodes detection for some models — one showed roughly a 12.7 per cent attack success rate after refinement — while prompt injection and multilingual inputs raise false positives and bypasses in measurable ways. In multilingual tests, false positives jump markedly, with one model's rate moving from around 2.4 per cent to about 24.1 per cent in certain languages.

Put simply, LLMs can be an effective layer of defence but they are brittle. Attackers who refine wording, translate messages or chain instructions can push an LLM out of its lane. The study also notes that the experiments focus on content-level attacks and do not cover attachments, compromised accounts or network-level tricks — so real-world risk is larger than the paper measures.

Practical controls

If you are deploying an LLM for email defence treat it like a new network appliance: assume initial behaviour is optimistic and hammer it. Start with three minimal controls. First, input sanitisation that normalises encodings, strips suspicious markup and flags homoglyphs. Second, prompt monitoring so you log and version system prompts and reject prompts that noticeably change across batches. Third, ensemble checks where a second model or heuristic verifies any high-risk classification before action.

For teams that need a road map, good-better-best works here. Good: deploy input sanitisation, fixed system prompts and conservative blocking rules. Better: add adversarial training with refined phishing samples, multilingual evaluation, and a monitoring pipeline that tracks model drift. Best: use an ensemble of diverse detectors, real-time sender verification, and periodic red-team campaigns that combine prompt injection, translation and refinement strategies.

In short, the paper shows LLMs can raise the bar for phishing detection but they do not remove the need for classic controls. Keep the model focused, verify decisions with independent checks and assume attackers will chain techniques. That pragmatic stance buys you useful detection today and time to harden for tomorrow.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

LLM-PEA: Leveraging Large Language Models Against Phishing Email Attacks

Authors: Najmul Hassan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu, and Shaohu Zhang

Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

🔍 ShortSpan Analysis of the Paper

Problem

Phishing emails remain a critical cybersecurity threat and, as organisations deploy Large Language Models for email security, these systems face evolving phishing attacks that exploit architectural weaknesses. Current LLMs require substantial hardening before deployment to counter coordinated multi vector attacks. This paper introduces LLM PEA, a framework to detect phishing across multiple attack vectors including prompt injection, text refinement and multilingual assaults. The evaluation covers three frontier LLMs and various prompting designs to assess feasibility, robustness and limits. The findings indicate that LLMs can achieve over 90 per cent detection accuracy in certain configurations but are vulnerable to adversarial tactics and multilingual manipulation, highlighting risks for real world deployments and underscoring the need for secure AI deployment and governance.

Approach

LLM PEA comprises three components: an email ingestion module that normalises email content, an adversarial attack generation module that creates phishing variants using strategies such as instruction injection, paragraph restructuring, context manipulation, authority impersonation, confidence bypass, and multilingual transformations, and a downstream decision module where the LLM performs phishing detection or content evaluation. The threat model assumes a content level adversary who crafts arbitrary email text with indirect access to the LLM via the email interface, fixed system prompts, and no sender verification. The evaluation employs three prompting strategies: structured prompting that explicitly enumerates five detection criteria (urgent language, suspicious domains, requests for sensitive information, grammatical errors, unprofessional structure), zero shot prompting, and chain of thought prompting. Output is constrained to Phishing Email or Safe Email, with post processing used to extract classifications. The Phishing Email Detection dataset is used, containing a 61 per cent legitimate and 39 per cent phishing distribution, and five data configurations including a balanced 50:50 split, an imbalanced 90:10 split, a subset of adversarially transformed phishing samples, a prompt injection test set, and a multilingual evaluation across Bangla, Chinese and Hindi with translations.

Key Findings

On balanced data, GPT 4o achieves 95 per cent accuracy, Claude Sonnet 4 94 per cent and Grok 3 88 per cent.
In imbalanced testing, zero shot prompting outperforms structured prompts with an average F1 of 0.793 versus 0.657, and GPT 4o zero shot attains an F1 of 0.864, while chain of thought prompts show high per model variation.
Adversarial refinement reduces detection effectiveness modestly for GPT 4o (0.0 per cent accuracy loss on original phishing and small degradation afterwards) but more markedly for Claude Sonnet 4, with about 12.7 per cent attack success after adversarial transformation and Grok 3 showing resilience with no observed losses on certain samples.
Prompt injection tests reveal varying vulnerabilities: instruction override alone yields up to 4.2 per cent attack success for GPT 4o, 1.3 per cent for Claude Sonnet 4 and 12.3 per cent for Grok 3; combined prompt injections raise overall susceptibility with Claude Sonnet 4 at 2.9 per cent, GPT 4o at 1.1 per cent and Grok 3 at 1.6 per cent, all while precision remains at 1.00.
Multilingual evaluation shows substantial false positive rate increases in cross language settings, with Claude Sonnet 4 rising from 2.4 per cent to 24.1 per cent, GPT 4o from 10.0 to 13.7 per cent and Grok 3 from 24.1 to 43.3 per cent, with Bangla languages driving the strongest degradation and precision collapsing in some cases, indicating fundamental multilingual limitations.
Overall, the study indicates that frontier LLMs can reach around 95 per cent accuracy in phishing detection but are vulnerable to adversarial, prompt related and multilingual attacks, particularly in combined multi vector scenarios.

Limitations

The evaluation focuses on content level phishing attacks within email interfaces and does not consider attachments, network level exploits or compromised accounts. The experiments cover three frontier LLMs and a single detection dataset with a fixed class distribution, plus translations for a subset of languages, which may limit generalisability. Results show significant sensitivity to prompting strategy, class distribution and multilingual processing, indicating that performance in production would require additional hardening and defence measures. Encoding artefacts were observed but not representative of all homoglyph based manipulations, and the study does not exhaust all potential adversarial techniques or operational constraints of real world email systems.

Why It Matters

LLM based phishing detection can be highly effective yet fragile when faced with adversarial refinements, prompt injections and multilingual attacks. The findings emphasise the need for robust prompting, input sanitisation, prompt monitoring, adversarial training and potentially multi model or ensemble checks to maintain reliable detection in practice. The work highlights societal and security implications, showing that attackers could scale convincing phishing across languages, threatening finance, health care and public institutions, and stresses the importance of secure AI deployment, governance and ongoing vulnerability assessment to prevent manipulation and large scale disruption.

Attribution Original paper on arXiv