LLM scanners mislead when their judges disagree
Pentesting
Security teams love a number they can put in a dashboard. For Large Language Model (LLM) systems, that number is often Attack Success Rate (ASR) from automated red-teaming scanners. The conceit is simple: higher ASR means a riskier model. The new study on evaluator instability shows that conceit does not hold.
The weak link is the evaluator, the component that decides whether an attack has succeeded. Change the judge while keeping the prompts and responses the same, and the headline risk number can move by as much as 33%. That is not noise; it is the difference between “ship” and “stop” in many organisations.
The authors put this to the test on the widely used Garak scanner. Across 25 attack categories, 22 exceeded a small disagreement threshold, meaning the ASR you see depends on which evaluator you pick. Some attacks had near-total evaluator disagreement, while a few were stable. The team ran close to 23,000 prompt–response pairs at near-deterministic settings, compared Garak’s rule-based judges with a dynamic LLM judge, and tested three target models: Mistral-Small 8B, Cohere CommandA 111B and OpenAI GPT-5-mini.
Translation for practitioners: ASR is not an intrinsic property of a model. It is a property of your evaluator. If you are using scanner scores to compare suppliers, gate releases or feed quarterly risk KPIs, you are building on sand.
The study offers a practical fix. First, treat evaluation as a measurement problem. Phase I runs a disagreement diagnostic: apply two evaluators to the same samples and flag attack categories where their labels diverge beyond a tight threshold (tau=0.05 in the paper). Phase II adds a verification-backed judge: an independent, reasoning-capable verifier LLM (GPT-5.2) using a structured rubric re-assesses flagged samples. That lets you quantify evaluator reliability without a mountain of human labels. On flagged categories, the dynamic LLM evaluator aligned with the verifier at 89% accuracy versus 72% for the rule-based one. The verifier itself matched human annotation at 93% (Cohen’s kappa 0.85). You do not have to replace everything either: selectively swapping a small set of high-impact evaluators improved aggregate accuracy fast, while replacing all evaluators added about $5.25 per scan in their setup.
Yes, the verifier is still an LLM, not gospel. Performance varies by attack type. Costs depend on your model menu. And the experiments centre on Garak. Fine. The alternative is worse: taking scanner outputs as ground truth when they plainly are not.
My view is blunt. Until your pipeline measures and reports evaluator reliability, its ASR is not a security metric. It is a guess with a confidence problem. Treat scanner outputs as measurements with error bars. Record the evaluator identity and version with every result. Run disagreement checks in CI and push unstable categories through verification. The paper shows you can buy a lot of reliability for about five dollars a scan.
Use scanners for what they do well: generating cases and flushing out obvious failures. Do not use them to rank LLMs or enforce hard policy thresholds unless you also instrument the evaluator and show uncertainty. This work gives you the tooling to do that. The hype and the doom can wait; measurement comes first.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
When Scanners Lie: Evaluator Instability in LLM Red-Teaming
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how automated vulnerability scanners for large language models (LLMs) produce Attack Success Rate (ASR) measurements that depend not only on attacks and model outputs but crucially on the evaluator that judges whether an attack succeeded. If the evaluator is unstable or inaccurate, reported vulnerability scores can be misleading. The work quantifies this evaluator dependence and proposes a practical framework to detect and mitigate evaluator-induced measurement error.
Approach
The authors reframe red-teaming as a measurement problem and introduce a two-phase, reliability-aware evaluation framework. Phase I is a diagnostic that measures sample-level disagreement between two alternative evaluators applied to the same prompt–response pairs, using a disagreement rate D and a small operational threshold tau=0.05 to flag unstable attack categories. Phase II is remediation via a verification-backed judge: an independent, reasoning-capable LLM verifier (GPT-5.2) using a structured verification rubric re-assesses flagged samples to provide a reference signal for estimator reliability without large-scale human annotation. The framework was applied to Garak (v0.13.2), covering 25 attack categories, with three target models (Mistral-Small 8B, Cohere CommandA 111B, and OpenAI GPT-5-mini). Evaluators compared included Garak’s default rule-based judges (82% of built-in evaluators) and dynamic LLM-based judges (Grok-4.1). The experiments evaluated up to 100 samples per attack per run, producing about 23k prompt–response pairs under near-deterministic sampling (temperature 0).
Key Findings
- Evaluator instability is widespread: 22 of 25 attack categories exceeded the disagreement threshold, indicating ASR is frequently sensitive to evaluator choice.
- Disagreement can be large and attack-dependent: some attacks showed near-complete evaluator disagreement (D up to 0.97) while others were stable (D as low as 0.013); several attacks had D>0.50, meaning evaluator substitution flips most per-sample labels.
- Verification improves reliability: using the independent verifier, the dynamic LLM-based evaluator aligned with the verifier at 89% accuracy versus 72% for the static rule-based evaluator on flagged categories.
- Evaluator-induced ASR uncertainty is material: per-attack ASR intervals often exceeded ±20%, and the authors report reported vulnerability scores can vary by up to about 33% depending on the evaluator.
- Selective replacement is cost-effective: replacing a small number of high-impact evaluators raised aggregate accuracy substantially before full replacement; replacing all evaluators added about $5.25 per scan in the experimental setup.
- Verifier validity: the GPT-5.2 verifier agreed with human annotation at 93% (Cohen’s kappa 0.85), supporting its use as a scalable reference signal.
Limitations
The framework relies on an LLM-based verifier as an approximate reference rather than exhaustive human labelling. Evaluator performance varies by attack type, so no single evaluator is uniformly optimal. Cost estimates are specific to the models and pricing used in the experiments. The study focuses on one open-source scanner and the evaluated set of attack categories.
Why It Matters
ASR reported by automated scanners is not an absolute property of a model but an artefact of the chosen evaluator. Security practitioners should treat scanner ASR with caution, use disagreement diagnostics to flag unreliable attack types, and apply targeted verification or selective dynamic judges to improve measurement validity while controlling cost. The proposed framework gives a practical path to quantify evaluator uncertainty and reduce misleading vulnerability assessments in automated LLM security testing.