Automated Red-Teaming Exposes Global AI Disinformation Gaps

Pentesting

Published: Wed, Sep 24, 2025 • By Theo Solander

Automated Red-Teaming Exposes Global AI Disinformation Gaps

A new method called anecdoctoring automates multilingual adversarial prompt generation using nearly 9,815 fact-checked claims in English, Spanish and Hindi from the US and India. By clustering narratives and adding knowledge graphs, it raises attack success rates above 80% for several models and shows where English-centric safety testing leaves dangerous blind spots.

Researchers present anecdoctoring, an automated red-teaming approach that generates culturally relevant adversarial prompts from real misinformation. It matters because generative models deployed globally can be probed and coaxed into producing disinformation in multiple languages, and many current evaluations remain English and US centric.

For security teams the scope is practical and immediate: the authors compile almost 9,815 fact-checked items across English, Spanish and Hindi from the United States and India, then turn those items into narrative clusters and knowledge graphs to feed an attacker Large Language Model (LLM). The outcome is higher attack success rates and clearer traces of how narratives travel through language and place.

How the method works

The pipeline clusters claims using multilingual embeddings, dimensionality reduction and density clustering, then constructs knowledge graphs for each narrative. An attacker LLM uses that structured context to generate adversarial prompts, typically as social posts. A two-stage quality control combines an automated Judge LLM for harm scoring with human review. The authors test GPT-4o, GPT-4o mini and Llama 3 models and integrate orchestration code into PyRIT.

Results are stark. Adversarial prompts grounded in real-world narratives yield attack success rates in excess of 80% for several models. Clustering increases success by about 14 percentage points versus individual claims, and adding knowledge graphs adds a further roughly 9 points. The GPT family shows higher and more consistent susceptibility than the open-source Llama variants.

Impact and risk are clear: culturally tailored disinformation can scale across languages and models, and medium and locale affect success. The authors note a genuine misuse risk since the technique could be repurposed by bad actors.

Defensive takeaways are concrete. Teams should add multilingual, place-aware red-teaming to test suites, apply narrative clustering and knowledge-graph auditing to improve interpretability, and enforce cross-lingual prompt filtering and human linguistic review. Test both closed and open models and be wary of overestimating robustness from English-only checks.

Limitations matter too: repeated interactions may inflate measured success, the approach relies on languages with decent training data coverage, and substantial human review remains necessary. The historical lesson is familiar: safety practices that ignore language and locality produce blind spots. Practitioners should treat multilingual red-teaming as standard, not optional, as models continue to spread worldwide.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Anecdoctoring: Automated Red-Teaming Across Language and Place

Authors: Alejandro Cuevas, Saloni Dash, Bharat Kumar Nayak, Dan Vann, and Madeleine I. G. Daepp

Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose "anecdoctoring", a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.

🔍 ShortSpan Analysis of the Paper

Problem

Disinformation is a major risk associated with the misuse of generative AI. Red-teaming to probe AI systems across languages and cultures is essential, yet most red-teaming data are US English centric, limiting global safety assessments. The paper introduces anecdoctoring, an automated method to generate adversarial prompts across languages and cultures by leveraging real world misinformation data and knowledge graphs to augment an attacker large language model. The approach aims to produce higher, more interpretable attack success rates than simple prompt baselines and to highlight cross language and cross cultural safety gaps in AI systems.

Approach

The method comprises three steps. First, misinformation claims are gathered from public fact checking sources in English, Spanish and Hindi, across the United States and India, resulting in a dataset of 9 815 articles spanning 2022 to 2024. Second, claims are clustered to surface dominant narratives and to enable coherent, narrative based prompts; embeddings are created with a multilingual model, reduced with UMAP, and clustered with HDBSCAN separately for each language and geography. Third, knowledge graphs are constructed for each narrative cluster using few shot prompting, and an attacker LLM uses these graphs to generate adversarial prompts in the form of tweets. An attacker LLM can also be prompted to produce other content such as news headlines in ablation studies. The system integrates a two stage quality control: automatic harm scoring with a Judge LLM and manual review by linguistically skilled researchers; prompts considered successful produce outputs that violate typical safety policies. The authors test four models (GPT-4o, GPT-4o mini, Llama 3 1 70b, Llama 3 1 8b) across the language location pairs, and compare to baselines using one shot prompts with individual claims and few shot prompts using clusters without KG augmentation. The project builds on PyRIT to provide orchestration code for practitioners.

Key Findings

High attack success rates across languages and models: adversarial prompts grounded in real world misinformation achieve attack success rates in excess of 80 per cent for several models across all language location pairs.
KG augmentation yields interpretability with minimal ASR penalty: clustering alone increases ASR by about 14 percentage points over using individual claims, and incorporating knowledge graphs adds a further roughly 9 percentage points, while overall ASRs remain high.
Model and language differences: the GPT family shows higher ASRs and lower variance than open source Llama models; GPT-4o mini has the highest average ASR across conditions, while Llama 8b records the lowest; the larger Llama 70b outperforms the 8b model but still trails GPT models in overall performance.
Language and medium effects: removing triggers or changing content medium influences ASR in ablation tests, with triggers not being the sole driver of high ASRs and medium (tweet versus news headline) affecting some languages differently; Hindi with certain LLama settings showed notable sensitivity to medium changes.
Interpretability and localisation: knowledge graphs reveal locale specific narrative elements, such as top political figures and contested NORPs that differ between India and the United States, aiding defenders in understanding and policing place specific risks; clusters illustrate how narratives connect entities around topics like vaccines and elections.
Open science and validation: the authors provide their orchestration code as part of PyRIT and perform human validation on a 10 per cent sample, achieving substantial inter rater agreement in the US and moderate agreement in India, supporting the reliability of the automated evaluation pipeline.

Limitations

The study notes that ASRs may be overestimated for repeated model interactions since the prompts have already passed guardrails once, although they implemented manual review to mitigate this. Results with open source LLama models are comparatively lower, and the method may be less effective for low resource languages; the analysed languages are relatively well represented in training data. Human review was required at multiple stages, and expanding to more languages and regions would require additional data collection and ground truth work. The dataset's Spanish component was smaller, and extending across the 70 plus countries with available fact checking would demand broader data support. There is also a caveat that the approach could be misused by adversaries to generate culturally relevant disinformation, underscoring the need for scalable defensive countermeasures.

Why It Matters

The work highlights global disinformation risks and gaps in AI safety that arise when safety evaluations neglect linguistic and cultural diversity. It argues for multilingual, cross cultural red teaming as part of safety evaluations, the integration of knowledge graph based auditing to improve grounding and interpretability, stronger cross lingual defenses and robust prompt filtering, and the development of better non English datasets. The findings indicate that misuses can scale across languages and regions, potentially extending the reach and effectiveness of misinformation campaigns. The authors emphasise the societal importance of globally aware defenses and provide open source tooling to enable defenders to adapt the method to new places and models.

Attribution Original paper on arXiv