New TeleAI-Safety Benchmark Exposes LLM Jailbreak Risks

Pentesting

Published: Mon, Dec 08, 2025 • By James Armitage

New TeleAI-Safety Benchmark Exposes LLM Jailbreak Risks

TeleAI-Safety gives security teams a modular, reproducible way to probe large language model (LLM) jailbreaks. The benchmark tests 342 attack samples across 14 models with 19 attack techniques, 29 defence methods and 19 evaluation approaches. Results show model-specific weak points, evaluator variability and the need for multi-layer, ongoing testing in live deployments.

A new benchmark called TeleAI-Safety aims to bring some structure to a messy problem: how to test Large Language Models (LLMs) against jailbreaks and prompt-based exploits. The project bundles a curated corpus of 342 attack prompts across 12 risk categories, 19 attack methods including a self-developed technique named Morpheus, 29 defence approaches and 19 evaluation methods including a multi-agent evaluator called RADAR. It runs these combinations across 14 target models, from open-source variants to commercial APIs.

What the benchmark shows

The headline finding is straightforward. No single model or single defence stops every attack. White-box models tend to show higher attack success rates than black-box API models, and some proprietary APIs performed relatively well on these tests. The paper reports example adversarial success rates such as OpenAI o1 at 0.21, GPT 5 at 0.28 and Claude 3 at 0.11, while Grok 3 had higher success rates. Certain defences, named RA LLM and ICD in the evaluation, show broad robustness across multiple attack types. Others, such as paraphrasing layers and Layer Adv Patcher, offer only limited protection and can be bypassed by targeted attacks.

Another practical point is evaluator inconsistency. Different evaluation methods can produce markedly different attack success rates for the same prompt. The authors flag PAIR as particularly inconsistent and propose RADAR, a debate-style multi-agent method, to reduce bias. That matters for practitioners: without standardised evaluation you cannot reliably compare fixes or measure improvement.

Balancing risk, cost and realism

There are two common overreactions in this space. One side treats any jailbreak research as proof that LLMs are unusable in production. The other side assumes off-the-shelf models are inherently safe if vendors say so. TeleAI-Safety sits between those extremes. It does not claim to eliminate all risk, but it does provide a repeatable, extensible way to quantify it. The framework is computationally heavy and focused on text-only attacks, so it is not a turnkey solution for every environment, but it is a useful baseline for security teams that need evidence rather than anecdotes.

My view is pragmatic. Use TeleAI-Safety or a comparable modular approach as part of an ongoing risk programme. Treat results as directional, not absolute. Expect model-specific blind spots and evaluator variance. Put effort into multi-layer defence, operational controls and continuous red teaming rather than one-off hardening.

Practical steps for security teams include: run benchmarked tests that reflect your actual prompts and threat model; combine internal model hardening with external monitoring and rate limits; standardise evaluation so you can compare changes over time; and prioritise human review or gating for high-risk outputs. TeleAI-Safety does not end the conversation, but it helps turn debates about safety into measurable trade-offs. That utility matters more than alarmism or complacency.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

Authors: Xiuyuan Chen, Jian Zhao, Yuxiang He, Yuan Xun, Xinwei Liu, Yanshu Li, Huilin Zhou, Wei Cai, Ziyan Shi, Yuchen Yuan, Tianle Zhang, Chi Zhang, and Xuelong Li

While the deployment of large language models (LLMs) in high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited by an imbalanced integration of core components (attack, defense, and evaluation methods) and an isolation between flexible evaluation frameworks and standardized benchmarking capabilities. These limitations hinder reliable cross-study comparisons and create unnecessary overhead for comprehensive risk assessment. To address these gaps, we present TeleAI-Safety, a modular and reproducible framework coupled with a systematic benchmark for rigorous LLM safety evaluation. Our framework integrates a broad collection of 19 attack methods (including one self-developed method), 29 defense methods, and 19 evaluation methods (including one self-developed method). With a curated attack corpus of 342 samples spanning 12 distinct risk categories, the TeleAI-Safety benchmark conducts extensive evaluations across 14 target models. The results reveal systematic vulnerabilities and model-specific failure cases, highlighting critical trade-offs between safety and utility, and identifying potential defense patterns for future optimization. In practical scenarios, TeleAI-Safety can be flexibly adjusted with customized attack, defense, and evaluation combinations to meet specific demands. We release our complete code and evaluation results to facilitate reproducible research and establish unified safety baselines.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models are increasingly deployed in high value settings, but systematic safety evaluation against jailbreak and prompt based attacks remains scarce. Existing benchmarks and frameworks are often imbalanced across attacks, defence and evaluation and detached from standard benchmarking, hindering cross study comparisons and adding overhead to risk assessment. TeleAI Safety offers a modular, reproducible framework with a standardised benchmark for rigorous LLM safety evaluation. It integrates 19 attack methods (one self developed), 29 defence methods and 19 evaluation methods (one self developed). With a curated attack corpus of 342 samples spanning 12 risk categories and evaluations across 14 target models, the framework reveals systematic vulnerabilities and model specific failure cases, highlights safety and utility trade offs, and identifies defence patterns for optimisation. The framework can be customised for different attack defence and evaluation combinations. Code and evaluation results are released to support reproducible research and unified safety baselines.

Approach

TeleAI Safety is a modular open infrastructure that couples a framework with a standardised benchmark for systematic LLM safety assessment. The framework contains a data set of 342 attack samples across 12 risk categories, 19 attack methods including a self developed method named Morpheus, 29 defence methods, and 19 evaluation methods including a self developed method named RADAR. It evaluates 14 target models comprising nine black box models accessed by API and five open source white box models. The framework provides a co designed architecture where attacks, defences and evaluations share a common configuration and data flow. The design emphasises extensibility through YAML based configuration for adding new attacks, defences and evaluations. The Datasets module supports loading the self constructed data set and external data sources from the Hugging Face Hub or local files. The Models module provides a unified model loading function load_model that accepts a model type (local or openai), a model name, and a path, enabling seamless switching between local and API based models. The Attacks module implements 19 attack methods including Morpheus, spanning white box, grey box and black box access levels. The Defences module encompasses 29 methods, split into external and internal categories, and supports combining multiple defences. The Evaluations module offers 19 evaluation methods including RADAR, ranging from rapid pattern based screening to multi agent consensus based assessment. The TeleAI Safety dataset uses a 12 category taxonomy and 342 samples derived from a principled filtering process with baseline refusal and authenticity considerations to select prompts, resulting in a dataset of 342 samples. By testing across 14 models and 342 samples, TeleAI Safety enables systematic cross model safety evaluation under various attack and defence configurations.

Key Findings

The results reveal systematic vulnerabilities and model specific failure cases, with clear safety and utility trade offs and indications of defence patterns that can guide optimisation and future work.
White box models generally exhibit higher attack success rates than black box models. Some API models show relatively low ASR, for example OpenAI o1 0.21, GPT 5 0.28 and Claude 3 0.11, indicating stronger adversarial robustness, while Grok 3 exhibits higher ASR.
Defences such as RA LLM and ICD demonstrate robust performance against multiple attack variants, while paraphrasing and Layer Adv Patcher show limited effectiveness and some targeted attacks bypass specific defences, illustrating a generalisation crisis where a single layer defence is insufficient.
Evaluator variability is notable; different evaluators can yield markedly different ASR results for the same attack, with PAIR showing substantial inconsistency, underscoring the need for standardised evaluation frameworks. The RADAR method offers a multi agent debate based approach to reduce evaluator bias.
Across Vicuna 7B experiments, defence and attack combinations reduce ASR in many cases, and safety robustness varies by risk category. Proprietary models such as Claude 3 5 show stronger safety performance overall, while open source models display more uniform results and high robustness in categories such as harm to minors and sexual content.

Limitations

Limitations include a focus on text based jailbreaks with limited coverage of multimodal or multilingual vulnerabilities. Evaluator discrepancies reveal the need for standardised scoring schemes, and the benchmarking process incurs substantial computational demands which may constrain use in restricted environments. The framework reflects current attack and defence landscapes and outcomes may evolve as techniques change.

Why It Matters

TeleAI Safety provides a unified, extensible framework and benchmark that supports reproducible research and informed governance. By exposing jailbreak techniques and evaluating defensive strategies, it helps identify practical safety trade offs and informs deployment decisions to mitigate abuse such as disinformation privacy leakage and fraud. The work underlines the necessity of multi layer defence and ongoing adaptation to evolving adversaries, and supports community based baselines for safer real world LLM deployments. Future directions include extending to multimodal and cross lingual evaluation, automated red teaming for continual robustness assessment, and adaptive defence co training to harden models against emerging threats.

Attribution Original paper on arXiv