Persuasive LLM Rewrites Break Automated Fact-Checkers
Attacks
Automated fact-checking systems are supposed to be the organisation’s safety net against misinformation. A new study exposes a gaping seam in that net: a generative Large Language Model (LLM) can reword claims using classic persuasion techniques and cause both the evidence retriever and the veracity classifier to fail.
What the researchers did
The authors take a taxonomy of 23 persuasion techniques grouped into six categories and use a generative LLM to produce adversarial variants of claims. They test those variants on two standard benchmarks that use Wikipedia as source material, FEVER and FEVEROUS, and evaluate both retrieval and verification. To separate reasoning from retrieval, they run two settings: claim only, and gold evidence, the latter simulating a perfect retriever by giving the classifier the correct supporting text.
The results are blunt. Under a realistic blind attacker that applies random persuasive rewrites, claim-only accuracy drops by about 13.2 points on FEVER and 6.6 on FEVEROUS, while gold-evidence accuracy falls by roughly 17 points on both datasets. If an attacker can try multiple rewrites and pick the worst (an oracle attacker), accuracy collapses: claim-only accuracy falls to near zero (about 0.043 on FEVER and 0.010 on FEVEROUS) and remains severely damaged even with gold evidence. Retrieval suffers too: recall-at-five degrades substantially under blind attacks and drops to near zero under oracle attacks.
Not all rhetorical tricks are equal. Obfuscation, which replaces concrete terms with vague wording, and manipulative wording are particularly effective at confusing both retrieval and classification. Other techniques, such as certain types of simplification or reputation attacks, have limited impact when the correct evidence is available, underlining that strong grounding still helps—but it is not a panacea.
Why this matters
The practical implication is stark. Disinformation operators do not need exotic new tricks; they can leverage an LLM to craft persuasive rewrites that slip past automated checks. That undermines trust in automated moderation pipelines and raises the cost of relying on fully automated fact-checkers for high-stakes content.
The study has sensible limits: experiments are English-only, use Wikipedia-based benchmarks, and rely on mid-sized LLMs. Still, the authors checked generator independence with a second model and released code to help defensive work. The vulnerability looks model-agnostic rather than a quirk of a single generator.
Defensive options are available and mostly familiar: augment training data with persuasive rewrites, develop detectors for rhetorical cues, strengthen evidence-grounding and cross-check with multiple models, and route high-risk claims to human reviewers. None of these is easy, but ignoring the problem is worse.
Two concrete actions for practitioners: first, run a quick internal audit by generating persuasive rewrites of your most sensitive claim types and measure retrieval and classifier stability. Second, add a simple rhetorical-signal check to your pipeline that flags claims with heavy obfuscation or manipulative wording for human review. Those steps will not fix everything, but they raise the bar for an attacker and buy time to build deeper robustness.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems
🔍 ShortSpan Analysis of the Paper
Problem
Automated fact checking is increasingly used to counter disinformation, but current fact checking pipelines remain vulnerable to adversarial manipulation. Existing attack frameworks mainly inject noise or alter semantics; however persuasion techniques commonly used in disinformation campaigns are not yet exploited to undermine AFCs. This paper introduces a novel class of persuasive adversarial attacks that employ a generative large language model to rewrite claims using persuasion techniques, aiming to degrade both claim verification and evidence retrieval. Experiments on FEVER and FEVEROUS show substantial degradation of verification accuracy and retrieval quality, underscoring a vulnerability in AFC pipelines and the need for more robust systems capable of resisting persuasive rewrites.
Approach
The authors treat persuasion injection as an upstream perturbation to a typical AFC pipeline comprising an evidence retriever and a veracity classifier. They adopt a taxonomy of 23 persuasion techniques grouped into 6 high level categories: Attack on Reputation, Justification, Distraction, Simplification, Call, and Manipulative Wording. For each claim, they generate 23 adversarial variants using these techniques with a generative model (Qwen2.5 7B Instruct; robustness checked against Llama 3 8B Instruct). They verify label invariance via manual validation on 690 persuasive claims, excluding techniques with low label preservation, leaving 1515 retained techniques. The veracity classifier uses RoBERTa Base, trained on FEVER style data; two evaluation settings are used: claim only and gold evidence, the latter simulating a perfect retriever by concatenating the claim with gold evidence. They employ BM25 retrieval to assess retrieval robustness on FEVER and FEVEROUS and report standard metrics such as accuracy, macro F1, ROC AUC, and Attack Success Rate, together with Recall at k for retrieval. The work also details two attacker capabilities, Blind (average over random techniques) and Oracle (selects the most damaging variant for each claim). They also perform a robustness check with Llama 3 8B Instruct to ensure generator independence of results, and provide the code under an open source MIT license to support defensive research.
Key Findings
- Persuasion based attacks degrade both verification and retrieval far more than standard lexical or paraphrastic perturbations, with the attacker able to collapse accuracy when optimised for maximum damage.
- Under the Blind setting on FEVER and FEVEROUS, claim only accuracy drops by about 13.2 and 6.6 respectively, while gold evidence accuracy falls by roughly 17 points on both datasets.
- Under the Oracle attacker, claim only accuracy falls to near zero (about 0.043 on FEVER and 0.010 on FEVEROUS), and gold evidence accuracy also remains severely degraded (about 0.164 on FEVER and 0.250 on FEVEROUS).
- Attack Success Rate shows that gold evidence models are more robust to evasion under Blind, with evasion rates around 7 per cent on FEVER and 4 per cent on FEVEROUS, but under Oracle evasion can reach 75.9 per cent on FEVER and 43.4 per cent on FEVEROUS; sabotage rates approach near completeness under Oracle in both settings, indicating the potential to force false positives even when evidence is available.
- Retrieval performance degrades markedly under persuasion attacks, with Recall at five dropping substantially under Blind and collapsing to near zero under Oracle, illustrating a complete retrieval failure in worst case; obfuscation and manipulative wording are among the most disruptive techniques, substantially weakening both retrieval and classification.
- Obfuscation, replacing concrete terms with vague expressions, is a particularly damaging technique, and manipulating wording also strongly degrades performance; some techniques within attack on reputation, distraction, or simplification show limited impact when gold evidence is available, highlighting the importance of strong grounding and context.
- Robustness checks with Llama 3 8B Instruct yield results closely matching Qwen based attacks, suggesting the vulnerability is not model specific, but generator-agnostic to a large extent. The authors report substantial code release to support further defence research.
Limitations
The study is limited to English and to FEVER and FEVEROUS benchmarks based on Wikipedia, which may not reflect disinformation in social media or news articles. The experiments use modest sized LLMs around 7 billion parameters; results may differ with larger or more capable models. A manual validation step excludes techniques with insufficient label preservation, so some persuasive variants might behave differently in other contexts. The evaluation decouples retrieval from reasoning by using gold evidence in the gold evidence setting, which may not capture real world noise in retrieval. The reported results are based on controlled experiments with fixed data splits and computational assumptions, so generalisation to other domains requires further validation.
Why It Matters
The work reveals a new vulnerability in AFC systems: persuasive rewriting crafted by LLMs can undermine both the reasoning and the evidence gathering components of automated moderation. The findings imply that disinformation campaigns could exploit persuasive rewrites to evade automated checks, eroding trust in moderation and enabling misleading content to spread. Potential mitigations include adversarial and robust training with persuasive rewrites, detection of rhetorical and persuasion cues, stronger evidence grounding and multi model verification, retrieval augmented checks with policy constraints, and human in the loop review for high stakes claims. The societal impact emphasises the risks of manipulation and information control in public discourse, as automated persuasion could influence populations and undermine the reliability of fact checking. The authors also note that their work does not introduce new persuasion techniques, but documents and exposes vulnerabilities to guide defence development, and they release the attack code under an open source MIT licence to support reproducibility and defensive research.