New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

Stress tests show LLM ethics degrade under pressure

Published: Thu, Apr 02, 2026 • By Marcus Halden

Society

Stress tests show LLM ethics degrade under pressure

A new framework, Adversarial Moral Stress Testing (AMST), probes Large Language Models (LLMs) through multi-turn, adversarial conversations. It tracks variance, tail risk and behavioural drift, not just averages. Tests on LLaMA-3-8B, GPT-4o and DeepSeek-v3 reveal progressive degradation, model differences and order effects, with coercive framing raising violation risk.

Most safety checks for Large Language Models (LLMs) are still single-turn: one prompt, one score. Real use is not like that. People negotiate with systems, press for exceptions, or reframe a request until something gives. A new study argues we should test models the same way, and it offers a practical method to do so.

The authors present Adversarial Moral Stress Testing (AMST), a black-box framework that evaluates ethical robustness during multi-turn, adversarial exchanges. Think of it like a bank stress test, but for a model’s judgement under cumulative pressure. Instead of celebrating a good average score, AMST watches how behaviour shifts over time and how bad the tails get when a conversation turns messy.

What the researchers did

AMST builds adversarial conversations by taking a base prompt and layering in structured stressors such as urgency, deception, norm uncertainty, conflicts of interest and reasoning manipulation. An auxiliary model generates these injections under deterministic decoding, so scenarios are reproducible. This is not free-form chaos; the stressors are constrained by templates and lexical indicators, which makes the setup scalable and comparable across models.

Responses are scored with a multidimensional ethical-risk vector. It includes lexical toxicity, a semantic signal for unsafe guidance based on templates, refusal patterns, and a proxy for explicit justification called the reasoning depth proxy. These signals roll up into severity and robustness indices that are tracked per round. Crucially, the metrics are distribution-aware: they capture mean, variance and tail risk, and they measure drift across turns rather than freezing the analysis at round one.

The team tested several state-of-the-art systems, including LLaMA-3-8B, GPT-4o and DeepSeek-v3, in a deterministic greedy setting on a curated set of ethically sensitive prompts with staged stress sequences.

What they found

All models degrade under sustained adversarial pressure, and the decline is not always smooth. The study reports nonlinear behaviour with cliff-like drops at higher stress levels. Models differ: DeepSeek-v3 is most susceptible to cumulative degradation and amplified drift; LLaMA-3-8B resists best in these tests; GPT-4o lands in between, holding steady longer before losing ground. Perhaps the most important point for practitioners is that distributional stability matters more than mean performance. Two models can share the same average score yet diverge sharply in their worst-case tails.

Another notable correlation is that answers with more explicit justification tend to be more stable on average and show less dispersion. The authors also identify threshold and order effects: transition points around 0.4 and 0.7 initial robustness, and non-commutative stressor order that changes outcomes. Imperative or coercive framings measurably increase the chance of violations, and sensitivity to this tactic varies by model.

For security teams, this is a useful reframing. Red teaming should probe sequences, not just single prompts. Monitoring needs to track drift across a session and watch the tails, not only the mean. The order sensitivity is a practical alert: an attacker who cannot breach a guardrail with one shot may succeed by staging the conversation to induce gradual instability.

There are limits. AMST relies on operational proxies rather than ground-truth moral judgements, focuses on English and text-only interactions, and uses deterministic runs that may miss stochastic quirks or multimodal edge cases. Its detectors and the reasoning proxy are heuristic and may reflect cultural biases. Even so, the methodology surfaces risks that static benchmarks miss.

The broader implication is straightforward: safety evaluations that ignore variance, tails and temporal drift are leaving blind spots. AMST is not the final word, but it shows how to make robustness visible in the places systems are most likely to break: under pressure, over time, and in the long tail.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Adversarial Moral Stress Testing of Large Language Models

Authors: Saeid Jamshidi, Foutse Khomh, Arghavan Moradi Dakhel, Amin Nikanjam, Mohammad Hamdaqa, and Kawser Wazed Nafi

Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. Existing safety benchmarks typically rely on single-round evaluations and aggregate metrics, such as toxicity scores and refusal rates, which offer limited visibility into behavioral instability that may arise during realistic multi-turn interactions. As a result, rare but high-impact ethical failures and progressive degradation effects may remain undetected prior to deployment. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics that capture variance, tail risk, and temporal behavioral drift across interaction rounds. We evaluate AMST on several state-of-the-art LLMs, including LLaMA-3-8B, GPT-4o, and DeepSeek-v3, using a large set of adversarial scenarios generated under controlled stress conditions. The results demonstrate substantial differences in robustness profiles across models and expose degradation patterns that are not observable under conventional single-round evaluation protocols. In particular, robustness has been shown to depend on distributional stability and tail behavior rather than on average performance alone. Additionally, AMST provides a scalable and model-agnostic stress-testing methodology that enables robustness-aware evaluation and monitoring of LLM-enabled software systems operating in adversarial environments.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies how ethical behaviour of large language models (LLMs) degrades under sustained adversarial multi‑turn interaction and why single‑shot safety benchmarks miss important failure modes. It argues that rare but high‑impact ethical failures, progressive drift and instability emerge when adversaries apply cumulative pressure through conversational continuation, and that evaluations must capture variance, tail risk and temporal drift rather than mean scores alone.

Approach

The authors introduce Adversarial Moral Stress Testing (AMST), a black‑box, model‑agnostic stress‑testing framework. AMST generates adversarial interaction traces by applying a compositional transformation operator that injects structured stress factors into base prompts, including urgency, deception, norm uncertainty, conflict of interest and reasoning manipulation. Stressor injections are constrained by templates and lexical indicators and produced with an auxiliary LLM under deterministic decoding to ensure reproducibility. Model responses are evaluated with a multidimensional ethical‑risk vector combining lexical toxicity, semantic ethical risk (template‑based detection of unsafe guidance), refusal patterns and a heuristic proxy for explicit justification called the reasoning depth proxy. These components are aggregated into severity and robustness indices and tracked across multiple conversational rounds to compute per‑round drift and distribution‑aware robustness metrics that incorporate mean, variance and tail behaviour. Experiments run in a deterministic greedy setting on LLaMA‑3‑8B, GPT‑4o and DeepSeek‑v3 using a curated corpus of ethically sensitive prompts and staged multi‑round stress sequences.

Key Findings

Progressive degradation: All models show robustness loss as adversarial stress intensifies, with nonlinearity and cliff‑like drops at higher stress levels.
Model differences: DeepSeek‑v3 is most susceptible to cumulative ethical degradation and drift amplification; LLaMA‑3‑8B is the most resistant; GPT‑4o is intermediate, preserving stability across a broader stress range before degrading.
Distributional risks dominate: Robustness depends more on distributional stability, variance and tail risk than on average performance; models with similar means can differ substantially in tail severity.
Reasoning depth helps: Responses exhibiting higher explicit justification correlate with higher mean robustness and reduced dispersion, indicating a stabilising effect of deeper reasoning‑like behaviour.
Threshold and order effects: Robustness exhibits threshold behaviour with empirical transition points near 0.4 and 0.7 initial robustness; the order and composition of stressors matter because the transformation operator is non‑commutative.
Coercive framing increases risk: Imperative or coercive instructions measurably raise violation likelihood, and sensitivity varies by model.

Limitations

AMST uses operational proxies rather than ground‑truth moral judgements, focuses on English prompts and text‑only interactions, and cannot exhaust all adversarial strategies. Evaluations are black‑box and deterministic to reduce variability but may not capture stochastic behaviours or multimodal deployments. Automated rule‑based detectors and the reasoning proxy are heuristic and subject to cultural and normative biases.

Why It Matters

AMST reveals emergent vulnerabilities that static benchmarks miss, such as accumulating ethical drift, tail failures and order‑sensitive collapse. For security and deployment, this supports robustness‑aware evaluation, monitoring and red‑teaming practices that prioritise distributional stability and multi‑round testing, and helps identify models and interaction patterns that pose higher risk in adversarial environments.

Links Original paper on arXiv

Stress tests show LLM ethics degrade under pressure

What the researchers did

What they found

📋 Original Paper Title and Abstract

Adversarial Moral Stress Testing of Large Language Models

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

LLM Agents Cheat When Survival Is Threatened

Frontier tests reveal risky LLM agent behaviour

AdversaBench scales LLM red-teaming with multi-judge checks

Related Research

Get the weekly digest