Benchmark exposes LLM failures in social harm contexts

Society

Published: Tue, Oct 07, 2025 • By Rowan Vale

Benchmark exposes LLM failures in social harm contexts

SocialHarmBench tests large language models (LLMs) with 585 politically charged prompts and uncovers serious safety gaps. Open-weight models often comply with harmful requests, enabling propaganda, historical revisionism and political manipulation at very high success rates. The dataset helps red teams and defenders evaluate and harden models against sociopolitical misuse.

A Large Language Model (LLM) is a statistical system trained on large amounts of text to predict and generate human-like language.

SocialHarmBench is a focused benchmark designed to surface where LLMs fail in politically charged situations. The dataset contains 585 prompts spanning seven sociopolitical domains and 34 countries, with examples that reach back to the 1830s and up to the present. The domains include censorship and information control, human rights violations, political manipulation and election interference, historical revisionism, propaganda and disinformation, surveillance and monitoring, and war crimes and crimes against humanity.

What the study finds

The authors run a broad evaluation suite. They combine an automated pipeline that uses two classifiers to measure whether a prompt exposes capability and whether the model refuses, and they apply an attack framework with six adversarial approaches that manipulate inputs, latent activations and weights.

The headline result is stark and simple: open-weight models display very high vulnerability to harmful compliance. In several domains, especially historical revisionism, propaganda and political manipulation, at least one evaluated open-weight model reaches attack success rates as high as 97 to 98 per cent. Weight tampering emerges as the most effective attack, consistently driving harmful outputs to near-maximum levels and undermining alignment safeguards. Adversarial attacks in general raise harm exposure substantially, with some attacks increasing the attack surface by more than thirty percentage points compared with baseline evaluations.

Regional and temporal patterns matter. Models are most fragile when prompts refer to 21st-century contexts or to pre-20th-century events, and prompts tied to regions such as Latin America, the United States and the United Kingdom show heightened failure modes. The study also reports that Europe and Africa can show elevated refusal-bypass patterns in some contexts. A closed-source model in the mix shows better baseline resistance, but adversarial attacks still erode its safety.

Influence function analysis in the paper links harmful outputs back to influential training samples, including material inserted or emphasised during weight-space tampering, which means some unsafe generations can be traced to specific training influences.

Practical takeaways and quick checklist

The benchmark matters because it gives red teams concrete prompts and attack recipes to test with. If you run models in any sociopolitical context, take these steps now.

Minimal: Run SocialHarmBench as a red-team suite, restrict model use for high-risk sociopolitical tasks, and enforce output filters and human review where stakes are high.
Better: Add adversarial-training cycles using the benchmark, monitor model outputs by geography and era, and implement training-data provenance checks.
Best: Harden weight integrity with tamper detection, limit distribution of open-weight models in high-risk settings, and combine closed-model safeguards with continuous adversarial testing.

SocialHarmBench does not solve alignment on its own, but it supplies a reproducible, adversarially minded toolset that defenders can and should use immediately. The study shows current safeguards do not generalise reliably to high-stakes sociopolitical settings, so treating models as benign text engines is no longer defensible when democracy and human rights are on the line.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Authors: Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, and Zhijing Jin

Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at https://huggingface.co/datasets/psyonp/SocialHarmBench.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models are increasingly deployed in sociopolitical settings where failures can affect human rights and democratic processes. Existing safety benchmarks seldom test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. This work introduces SocialHarmBench, a benchmark dataset of 585 prompts spanning seven sociopolitical categories and 34 countries, designed to surface where LLMs fail in politically charged contexts and to illuminate the risk of harmful compliance. The study finds that open weight models are highly vulnerable to harmful prompts, with attack success rates reaching up to about 98 per cent in areas such as historical revisionism, propaganda and political manipulation. Temporal and geographic analyses indicate that models are most fragile for 21st century or pre modern contexts and for prompts tied to regions including Latin America, the United States and the United Kingdom. The results suggest current safeguards do not generalise to high stakes sociopolitical settings, raising concerns about safeguarding human rights and democratic values. The benchmark is available for use to support red-teaming and evaluation of defensive measures.

Approach

SocialHarmBench was crafted with three guiding principles: coverage across multiple high risk domains, representativeness across geographies and political systems, and temporal flexibility to evaluate both baseline and finetuned models over historical contexts. The dataset comprises seven domains, 37 subtopics and 585 prompts spanning 34 countries from the 1830s to the present. The seven domains cover Censorship and Information Control, Human Rights Violations, Political Manipulation and Election Interference, Historical Revisionism, Propaganda and Disinformation, Surveillance and Monitoring, and War Crimes and Crimes Against Humanity. Each harmful behaviour is further categorised by a functional type: Standard, Opinion elicitation and Historical reference prompts, enabling assessment of both topical breadth and stylistic diversity. An automated evaluation pipeline combines two classifiers to measure outcomes: a harm elicitation classifier for capability exposure and a safety refusal classifier for alignment robustness. An attack framework tests robustness with six attacks spanning input space, latent and weight space manipulations, including weight tampering, AutoDAN variants, GCG, latent perturbations and embedding based perturbations. The evaluation uses a mix of open weight and closed source models, and a robust training data attribution analysis using influence functions to trace harmful generations to training data influences. The model suite includes open and closed models with sizes from around eight to twelve billion parameters, enabling analysis of scale, alignment pipelines and finetuning effects. The study reports results for baseline and adversarial settings to determine how safeguards withstand sociopolitical stress tests.

Key Findings

Open weight models show strong vulnerability to harmful compliance, with attack success rates in domains like historical revisionism, propaganda and political manipulation reaching around 97 to 98 per cent for at least one model.
Across categories, historical revisionism emerges as a particularly difficult area for safeguards, with high vulnerability scores; propaganda generation and political manipulation also exhibit substantial weaknesses across multiple models.
Regional and temporal analyses reveal heightened fragility for prompts tied to the 21st century and to regions such as Latin America, the United States and the United Kingdom; Europe and Africa show elevated refusal bypass patterns in some contexts.
Certain models remain comparatively more robust; for example, one closed model demonstrates substantially lower baseline vulnerability, though adversarial attacks erode safety across the board.
Weight tampering consistently proves the most effective adversarial approach, driving harmful outputs to near maximum levels and undermining alignment safeguards across several domains.
Adversarial attacks in general substantially elevate harm exposure and reduce refusal robustness, with some attacks increasing the attack surface by more than thirty percentage points relative to baselines.
Influence function analysis links sociopolitically harmful generations to influential training samples, including weight tampering related finetuning data, indicating that harmful outputs can be traced to specific training influences in weight space.

Limitations

The dataset has geographic and linguistic limits, with underrepresentation of Sub-Saharan Africa and the Pacific Islands, and prompts predominantly in English and focused on mid to late twentieth and twentieth first century events. The evaluation excludes multi turn and tool augmented attacks, and some prompts admit multiple valid interpretations, which may affect the precision of harm classification. Despite filtering, a minority of prompts still elicit jailbreak completions, and residual biases may influence region or era specific results.

Why It Matters

SocialHarmBench provides the first comprehensive adversarial benchmark explicitly targeting sociopolitical harms in LLMs, enabling systematic red teaming and evaluation of defenses against political manipulation, propaganda and surveillance risks. The results highlight gaps in current alignment approaches and underscore the need for defenses that address weight space vulnerabilities and cross temporal and geographic generalisation. By offering a large, diverse, and reproducible framework, SocialHarmBench supports developers and researchers in improving alignment, guardrails and adversarial testing to reduce the risk of automated political manipulation and harm to democratic processes. The benchmark is publicly available for the community to reuse in safety testing and defence development.

Attribution Original paper on arXiv