ShortSpan.ai logo Home

ForgeDAN exposes gaps in aligned LLM safeguards

Pentesting
Published: Tue, Nov 18, 2025 • By Dr. Marcus Halden
ForgeDAN exposes gaps in aligned LLM safeguards
ForgeDAN is an evolutionary attack framework that crafts subtle prompts to bypass safeguards in aligned Large Language Models (LLMs). The paper finds it outperforms prior methods, achieving high success on several models, and shows that simple keyword filters and shallow detectors leave an exploitable surface. The study urges layered defences and continual red-teaming.

The paper presents ForgeDAN, an automated method for finding prompts that coax aligned Large Language Models (LLMs) into producing harmful or disallowed outputs. Here aligned means the model has been trained or tuned to refuse dangerous instructions. The researchers treat jailbreak generation as an evolutionary search problem and focus on keeping prompts natural sounding while maximising the chance of a model complying with a risky request.

What the method does

ForgeDAN starts from a seed template and mutates it across three linguistic levels: characters, words and whole sentences. Those mutations are not random novelty for noveltys sake. The evolution is guided by a semantic fitness measure that uses text embeddings to compare the model response with a harmful reference, so successful candidates are semantically close to the target behaviour rather than just sharing surface keywords. A second check runs a dual judgement involving two small LLM classifiers: one asks whether the target model complied with the instruction and the other judges whether the reply is indeed harmful. The combination reduces the false positives that plague keyword detectors.

The outcomes are striking in scope but not theatrical. Reported attack success rates include 98.27 per cent on Gemma-2-9B and 87.50 per cent on Qwen2.5-7B, with lower but notable rates on DeepSeek-V3 and TranSpec-13B. In a real world dataset the method reached 100.00 per cent on Gemma-2-9B and similarly high rates on others. Ablation tests show the semantic fitness and the dual-judge step are the features that carry the most weight; removing them materially reduces performance.

What this means for defenders

There are two practical lessons. First, defences that rely on keyword lists or shallow pattern matching will miss many of these attacks because ForgeDAN intentionally preserves naturalness and avoids obvious trigger words. Second, a single safety classifier is brittle; the study shows that combining semantic awareness and compliance checks provides a better signal for whether a model has been led astray. That does not mean the solution is simple. ForgeDAN requires substantial compute to run the evolutionary loop and depends on the behaviour of the embedding and classifier models, so its success will vary by deployment and by model family.

The paper is useful for red teams. It provides a repeatable framework for stress-testing guardrails and highlights where a deployment should concentrate testing effort. It also reminds us that adversaries will prefer prompts that sound normal on the surface, making human review harder.

Limitations are worth flagging. The approach depends on embedding models and a proprietary dataset reported in the study, so we should not assume identical results in every context. The method can transfer between models but may perform differently on unseen domains or heavily customised deployments.

  • Operational takeaways: incorporate semantic checks into safety stacks, run adversarial prompt searches as part of regular red-teaming, and avoid relying solely on keyword filters.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Authors: Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang, Kaishuo Wei, Yuqi Yu, Weiping Wen, Xiaojie Wu, and Junhua Liu
The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies jailbreak attacks on aligned large language models LLMs and the security risks of prompts that bypass alignment safeguards to produce harmful outputs. It notes that existing automated jailbreak generation approaches such as AutoDAN suffer from limited mutation diversity shallow fitness evaluation and fragile keyword based detection. The work introduces ForgeDAN an evolutionary framework to generate semantically coherent and highly effective adversarial prompts against aligned LLMs and aims to expose gaps in current defenses enabling stronger red teaming and mitigations.

Approach

ForgeDAN integrates three core components. First a multi strategy mutation mechanism performing character word and sentence level perturbations to increase attack diversity while preserving harmful intent. Second a semantic fitness evaluation using a text embedding model to guide evolution toward semantically relevant outputs. Third a dual dimensional jailbreak judgement using two LLM based classifiers to assess compliance and harmfulness jointly reducing false positives. The workflow starts from a seed template and iteratively mutates selects and verifies candidates until successful jailbreaks are found. The mutation library is plugin based and extensible with eleven strategies across linguistic levels and semantic validation prior to fitness evaluation. Fitness uses embedding based similarity such as RoBERTa embeddings to compare model outputs with a harmful reference, replacing surface lexical metrics. The dual judgement matrix treats a jailbreak as successful only when the model complies is not refusing and the output is harmful. Experiments use AdvBench and a real world dataset, target models Gemma-2-9B Qwen2.5-7B DeepSeek-V3 and TranSpec-13B and measure Attack Success Rate as the main metric.

Key Findings

  • ForgeDAN achieves high jailbreaking success rates while keeping outputs natural and stealthy and outperforms state of the art baselines.
  • On Gemma-2-9B ForgeDAN attains 98.27 per cent ASR and on Qwen2.5-7B 87.50 per cent ASR, with DeepSeek-V3 58.65 per cent and TranSpec-13B 55.00 per cent.
  • In cross sample generalisation ForgeDAN yields high ASR across models with values between 54.23 per cent and 98.46 per cent on four target models, indicating good transferability and robustness.
  • In real world applicability ForgeDAN reaches 100.00 per cent on Gemma-2-9B 89.05 per cent on Qwen2.5-7B 57.66 per cent on DeepSeek-V3 and 56.20 per cent on TranSpec-13B.
  • Ablation studies show semantic fitness is critical and dual dimensional judgement essential; removing them dramatically reduces performance and replacing the judgement with simple keyword matching also impairs effectiveness.

Limitations

Limitations include reliance on computational resources to run the evolutionary search and evaluation and possible generalisation gaps to unseen models or domains. The approach depends on embedding based semantic fitness and dual dimensional detectors which may be affected by biases in the underlying models and datasets. The real world evaluation uses a proprietary dataset which may not capture all deployment contexts. The results are reported for the models and datasets studied and may not generalise to all aligned LLMs.

Why It Matters

The work demonstrates that alignment safeguards can be bypassed with diverse coherent prompts and highlights gaps in current defenses. It provides a practical framework ForgeDAN for automated red-teaming and stress testing of AI guards, and supports adversarially informed evaluation and continual testing. The findings and mitigation guidance point to the need for stronger multi layer defenses beyond keyword filters including robust alignment semantic awareness and resilient safety classifiers. Societal risks include disinformation covert harmful content or malware via AI and erosion of trust in AI assisted systems, especially in critical domains.


← Back to Latest