Influence analysis exposes poisoned LLM summarisers
New work dissects how fine-tuning data poisoning can quietly skew Large Language Model (LLM) summarisers without denting ROUGE. It pairs white-box influence analysis and gradient-ascent unlearning with a black-box sensitivity audit. Reported results: 85–92% detection, up to 96% behaviour recovery, minimal utility loss, and resilience to adaptive attacks.
Fine-tuning is where many teams try to squeeze value from Large Language Models (LLMs). It is also where small, messy datasets have outsized sway. This paper shows how easily an attacker can poison abstractive summarisation during fine-tuning so the model keeps scoring well on ROUGE while quietly flipping sentiment, injecting toxicity, distorting facts or nudging portrayals of demographic groups. In short, it breaks the bit you actually care about while the dashboard still reads green.
How the attacks bite
The attacker slips in or publishes doctored document–summary pairs. Because summarisation datasets are small, each example has disproportionate influence. The authors show poisoning can push models towards extractive copying, omit key lead content, or embed representational bias, all without tripping simple heuristics or backdoor-trigger checks. They also introduce attacks focused on factual distortion and bias that evade conventional alarms. In their tests, poisoned models become two to three times more sensitive to innocuous, semantics-preserving tweaks to the lead sentence. That volatility is a tell.
What actually helps
Two defences, one surgical, one black-box. The white-box path computes per-sample influence using a scalable approximation, ranks the fine-tuning data, then runs lightweight semantic checks for sentiment, toxicity, factual consistency and group bias. Suspect items are “unlearned” with gradient ascent to reverse their effect. The reported numbers are solid: at a top-20% influence threshold the method captures 75–88% of poisoned samples across attack types, achieves 92.6% detection precision in aggregate, and restores 85.3% of normal behaviour on average. Utility holds steady, with ROUGE-1 changing by about −0.004 absolute, and the whole detect-and-unlearn loop is fast enough to be usable, roughly 28 minutes end to end in their setup. They also report recovery of abstractive generation style to around 92–96%.
When you only have model access, the black-box audit measures Sensitivity to Adversarial Perturbations. It perturbs the lead sentence without changing meaning and watches whether lead content gets dropped. Clean models stay stable. Poisoned ones wobble. With a Calibration" target="_blank" rel="noopener" class="term-link">calibration threshold at τ=7.5, the authors report clean–poisoned separation with zero false positives in their evaluations, outperforming logit and perplexity baselines.
Is this more than a neat demo? Yes, mostly. Influence-based triage plus targeted unlearning is a pragmatic way to clean fine-tunes without a full retrain, and the SAP audit gives buyers of third-party checkpoints a cheap smoke test. Caveats remain: the white-box path assumes you own the data and weights, semantic filters inherit their own biases, and SAP needs a clean calibration. The bigger question is scope. This is summarisation-specific. Whether the fingerprints they exploit persist for other generation tasks is unproven. But if you ship LLM summarisers, this is one of the few defences that looks ready to meet a real threat rather than a lab curiosity.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies training-time data poisoning against abstractive text summarization models, focusing on fine-tuning as a vulnerable stage because small, task-specific datasets give each example disproportionate influence. Adversaries can insert or publish poisoned document–summary pairs to induce persistent failures such as sentiment inversion, toxic injection, factual distortion or representational bias while preserving standard metrics like ROUGE, making attacks stealthy and dangerous for downstream users.
Approach
The authors propose a unified post-hoc defence framework with two complementary branches. Defence-1 (white-box) assumes access to the fine-tuning corpus and model weights: it computes per-example training influence using a scalable DataInf approximation, selects high-influence candidates, applies lightweight semantic filters for sentiment, toxicity, factual consistency and group-based bias, then applies gradient-ascent unlearning to reverse poisoned gradients without full retraining. Defence-2 (black-box) assumes only model access: it audits model behaviour using a Sensitivity to Adversarial Perturbations (SAP) metric that measures how semantics-preserving lead-sentence perturbations cause exclusion of lead content in generated summaries; models with SAP above a clean-calibrated threshold are flagged as poisoned. Evaluation covers nine models and six summarization datasets across varying fine-tuning sizes and adaptive attack strategies.
Key Findings
- Poisoned training pairs concentrate in high-influence percentiles: at a Top-20% threshold 75–88% of poisoned samples are captured across attack types, enabling efficient pre‑filtering.
- Defence-1 achieves high detection precision (reported 92.6%) and average behavioural recovery of 85.3% across datasets, models and attack objectives while preserving utility (ROUGE-1 change ≈−0.004 absolute) and completing unlearning rapidly (unlearning step ≈14 minutes; influence computation similar, total ≈28 minutes).
- Gradient-ascent unlearning restores deeper generation behaviour: abstractive tendencies recover by about 92–96%, reversing poisoning-induced shifts toward extractive copying with residual differences typically below 0.02 in extractiveness metrics.
- Defence-2 (SAP) yields a large, contamination-consistent separation between clean and poisoned models; with a threshold τ=7.5 the authors report complete separation and zero false positives across evaluated conditions, outperforming black-box baselines based on logit sensitivity and perplexity.
- The framework remains resilient to adaptive attacks (mixed-influence, off-lead placement, mixed objectives): although some attacks reduce one defence’s effectiveness, the complementary design ensures at least one mechanism detects or recovers the compromise.
Limitations
Defence-1 requires access to model parameters and fine-tuning data, limiting applicability to open-source or self-hosted pipelines; Defence-2 can only audit, not remediate. Semantic filters rely on pre-trained classifiers whose biases or errors can affect filtering. The set of behavioural checks may need extension for future, more adaptive poisoning strategies.
Implications
An attacker able to inject or publish poisoned fine-tuning data can induce systematic semantic distortions—flipping sentiment, adding toxic content, fabricating facts or skewing demographic portrayals—while leaving standard evaluation scores unchanged, and can shift models toward extractive behaviour. Such poisoning can be distributed via released datasets or model checkpoints, is effective at low contamination rates reported in the paper, and can evade conventional trigger-based backdoor detectors. The results demonstrate that attackers can produce persistent, stealthy manipulations of summarization behaviour that require targeted detection and unlearning to remove.