ShortSpan.ai logo

LLM scanners mislead when their judges disagree

Pentesting
Published: Tue, Mar 17, 2026 • By James Armitage
LLM scanners mislead when their judges disagree
New research shows automated LLM red-teaming scanners can swing vulnerability scores by up to 33% depending on the evaluator judging success. Analysing Garak across 25 attack types, 22 showed evaluator instability. A two-phase, verifier-backed method lifts evaluator accuracy from 72% to 89% and adds uncertainty estimates to make results usable.

Security teams love a number they can put in a dashboard. For Large Language Model (LLM) systems, that number is often Attack Success Rate (ASR) from automated red-teaming scanners. The conceit is simple: higher ASR means a riskier model. The new study on evaluator instability shows that conceit does not hold.

The weak link is the evaluator, the component that decides whether an attack has succeeded. Change the judge while keeping the prompts and responses the same, and the headline risk number can move by as much as 33%. That is not noise; it is the difference between “ship” and “stop” in many organisations.

The authors put this to the test on the widely used Garak scanner. Across 25 attack categories, 22 exceeded a small disagreement threshold, meaning the ASR you see depends on which evaluator you pick. Some attacks had near-total evaluator disagreement, while a few were stable. The team ran close to 23,000 prompt–response pairs at near-deterministic settings, compared Garak’s rule-based judges with a dynamic LLM judge, and tested three target models: Mistral-Small 8B, Cohere CommandA 111B and OpenAI GPT-5-mini.

Translation for practitioners: ASR is not an intrinsic property of a model. It is a property of your evaluator. If you are using scanner scores to compare suppliers, gate releases or feed quarterly risk KPIs, you are building on sand.

The study offers a practical fix. First, treat evaluation as a measurement problem. Phase I runs a disagreement diagnostic: apply two evaluators to the same samples and flag attack categories where their labels diverge beyond a tight threshold (tau=0.05 in the paper). Phase II adds a verification-backed judge: an independent, reasoning-capable verifier LLM (GPT-5.2) using a structured rubric re-assesses flagged samples. That lets you quantify evaluator reliability without a mountain of human labels. On flagged categories, the dynamic LLM evaluator aligned with the verifier at 89% accuracy versus 72% for the rule-based one. The verifier itself matched human annotation at 93% (Cohen’s kappa 0.85). You do not have to replace everything either: selectively swapping a small set of high-impact evaluators improved aggregate accuracy fast, while replacing all evaluators added about $5.25 per scan in their setup.

Yes, the verifier is still an LLM, not gospel. Performance varies by attack type. Costs depend on your model menu. And the experiments centre on Garak. Fine. The alternative is worse: taking scanner outputs as ground truth when they plainly are not.

My view is blunt. Until your pipeline measures and reports evaluator reliability, its ASR is not a security metric. It is a guess with a confidence problem. Treat scanner outputs as measurements with error bars. Record the evaluator identity and version with every result. Run disagreement checks in CI and push unstable categories through verification. The paper shows you can buy a lot of reliability for about five dollars a scan.

Use scanners for what they do well: generating cases and flushing out obvious failures. Do not use them to rank LLMs or enforce hard policy thresholds unless you also instrument the evaluator and show uncertainty. This work gives you the tooling to do that. The hype and the doom can wait; measurement comes first.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Authors: Lidor Erez, Omer Hofman, Tamir Nizri, and Roman Vainshtein
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies how automated vulnerability scanners for large language models (LLMs) produce Attack Success Rate (ASR) measurements that depend not only on attacks and model outputs but crucially on the evaluator that judges whether an attack succeeded. If the evaluator is unstable or inaccurate, reported vulnerability scores can be misleading. The work quantifies this evaluator dependence and proposes a practical framework to detect and mitigate evaluator-induced measurement error.

Approach

The authors reframe red-teaming as a measurement problem and introduce a two-phase, reliability-aware evaluation framework. Phase I is a diagnostic that measures sample-level disagreement between two alternative evaluators applied to the same prompt–response pairs, using a disagreement rate D and a small operational threshold tau=0.05 to flag unstable attack categories. Phase II is remediation via a verification-backed judge: an independent, reasoning-capable LLM verifier (GPT-5.2) using a structured verification rubric re-assesses flagged samples to provide a reference signal for estimator reliability without large-scale human annotation. The framework was applied to Garak (v0.13.2), covering 25 attack categories, with three target models (Mistral-Small 8B, Cohere CommandA 111B, and OpenAI GPT-5-mini). Evaluators compared included Garak’s default rule-based judges (82% of built-in evaluators) and dynamic LLM-based judges (Grok-4.1). The experiments evaluated up to 100 samples per attack per run, producing about 23k prompt–response pairs under near-deterministic sampling (temperature 0).

Key Findings

  • Evaluator instability is widespread: 22 of 25 attack categories exceeded the disagreement threshold, indicating ASR is frequently sensitive to evaluator choice.
  • Disagreement can be large and attack-dependent: some attacks showed near-complete evaluator disagreement (D up to 0.97) while others were stable (D as low as 0.013); several attacks had D>0.50, meaning evaluator substitution flips most per-sample labels.
  • Verification improves reliability: using the independent verifier, the dynamic LLM-based evaluator aligned with the verifier at 89% accuracy versus 72% for the static rule-based evaluator on flagged categories.
  • Evaluator-induced ASR uncertainty is material: per-attack ASR intervals often exceeded ±20%, and the authors report reported vulnerability scores can vary by up to about 33% depending on the evaluator.
  • Selective replacement is cost-effective: replacing a small number of high-impact evaluators raised aggregate accuracy substantially before full replacement; replacing all evaluators added about $5.25 per scan in the experimental setup.
  • Verifier validity: the GPT-5.2 verifier agreed with human annotation at 93% (Cohen’s kappa 0.85), supporting its use as a scalable reference signal.

Limitations

The framework relies on an LLM-based verifier as an approximate reference rather than exhaustive human labelling. Evaluator performance varies by attack type, so no single evaluator is uniformly optimal. Cost estimates are specific to the models and pricing used in the experiments. The study focuses on one open-source scanner and the evaluated set of attack categories.

Why It Matters

ASR reported by automated scanners is not an absolute property of a model but an artefact of the chosen evaluator. Security practitioners should treat scanner ASR with caution, use disagreement diagnostics to flag unreliable attack types, and apply targeted verification or selective dynamic judges to improve measurement validity while controlling cost. The proposed framework gives a practical path to quantify evaluator uncertainty and reduce misleading vulnerability assessments in automated LLM security testing.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.