New research exposes LLM unlearning failures

Society

Published: Mon, Nov 10, 2025 • By James Armitage

New research exposes LLM unlearning failures

A new study shows that many so-called unlearning methods for large language models (LLMs) only appear to forget when tested deterministically. When models are sampled using realistic probabilistic decoding, sensitive material often reappears. The finding raises privacy and compliance risks and urges security teams to test models under realistic sampling and pursue stronger deletion guarantees.

Security teams and vendors like tidy narratives: retrain, remove and forget. The new paper Leak@$k$ argues that tidy is misleading. It evaluates several unlearning methods on common benchmarks and finds that deterministic checks, the kind many groups use, can give a false sense of security. When models are sampled probabilistically with realistic decoding settings, the information meant to have been removed often resurfaces.

What the study shows

The authors introduce leak@k, a practical metric that measures the probability that at least one of k independent samples from a model reveals forgotten information. They test multiple unlearning techniques across three benchmarks and under different sampling regimes. Two patterns matter. First, leakage grows as you draw more samples. Second, the top-p parameter that controls sampling breadth drives most of the leakage; temperature amplifies it further. In short, a model that looks clean when you ask it greedily can leak in normal conversational use.

This is not a nitpick. Probabilistic decoding is how deployed systems generate varied responses and how attackers probe models at scale. The experiments use common model families and standard evaluation metrics, and they show that even enhancements like dynamic data augmentation reduce but do not eliminate leakage. The paper uses LLaMA-derived models and benchmarks such as TOFU, MUSE and WMDP to demonstrate the effect.

Two ways to look at it

On one hand, critics of the paper could say it tests edge conditions: draw many samples, crank up top-p and you will eventually see rare generations. From that view the practical risk depends on usage patterns. If a service locks down sampling, logs queries and never exposes raw generations, the immediate concern is smaller. On the other hand, the paper highlights a structural weakness: unlearning techniques validated only with greedy decoding are fragile. Real-world deployments and auditors should not rely on a single deterministic check.

My read is pragmatic. The finding does not mean all LLM unlearning is hopeless, nor should it trigger blanket bans or panic. But it does mean our assessment practices are inadequate. Relying on greedy decoding to certify deletion is a compliance trap, especially for privacy regimes that recognise a right to be forgotten.

Practical steps follow directly from the evidence. Security teams should include probabilistic decoding in unlearning tests, measure leak@k for realistic k, and vary top-p and temperature. Consider stronger technical guarantees where feasible, such as formal differential privacy during training, strict data provenance and deletion from training indexes, and architectural patterns that separate sensitive retrieval data from the base model. Where deletion cannot be provably guaranteed, add monitoring, rate limits on sampling and conservative output filters.

Leak@$k$ is a useful wake-up call: the problem is not that someone proved a worst case, it is that our normal checks are too polite. For security practitioners the modest, defensible move is to stop trusting greedy tests alone and build evaluation and mitigation into procurement, testing and incident response. That pragmatic shift will buy time while researchers pursue more robust unlearning guarantees.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Authors: Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, and Mingyi Hong

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models trained on extensive corpora may memorise private, toxic, illegal or copyrighted content. Unlearning aims to forget such information while preserving general utility. The paper argues that almost all existing unlearning methods give the appearance of forgetting when evaluated with deterministic decoding, but sensitive information reliably resurfaces when models are sampled probabilistically. It introduces leak@k, a meta evaluation that measures the chance that forgotten knowledge reappears when generating k samples under realistic decoding, and shows that leakage persists across methods and tasks, highlighting that current unlearning techniques offer only limited forgetting and urgent need for more robust approaches.

Approach

The authors define leak@k as the probability that at least one of k independent generations reveals forgotten information, assessed with a core evaluation metric. They provide unbiased estimators leak@k hat and L worst k hat to quantify leakage across k samples. The evaluation uses three benchmarks TOFU, MUSE and WMDP and tests multiple unlearning methods including GradDiff, NPO, SimNPO, RMU and BLUR NPO, along with enhancements such as NPO plus entropy penalty and NPO SAM. They also introduce NPO Fix, a dynamic data augmentation method that detects leakage in generation and retrains on the augmented forget set. Evaluations are conducted under probabilistic decoding by varying temperature T and top p, with configurations including deterministic decoding as a baseline. The work employs diverse core metrics appropriate to each benchmark, such as ROUGE L, Cosine Similarity, Entailment Score and LLM based judgement, and uses LLaMA based models for the experiments. They also explore the reliability of LLM judges for assessing leakage and consider both generation based and answer selection based assessments. Data sources include TOFU with thousands of author QA pairs, MUSE News and Books with QA pairs from BBC News and Harry Potter content, and WMDP bio with a large biomedical question set, supplemented by a retain task evaluation on MMLU.

Key Findings

Leakage rises as more samples are drawn under probabilistic decoding; leak@k increases with k across TOFU, MUSE, and WMDP benchmarks, indicating that forgotten information becomes more likely to surface with increased sampling.
Decoding randomness dominates leakage: higher temperature and especially higher top p strongly amplify leakage, while the top p parameter is the primary driver of probabilistic leakage; temperature acts as a secondary amplifier.
Deterministic evaluations can be misleading; under probabilistic decoding, even models that appear to have forgotten in greedy decoding exhibit leakage, and this is corroborated by LLM judged and other core metrics such as Entailment Score and ROUGE L in different benchmarks.
Leakage patterns vary by benchmark: TOFU and MUSE News show increasing leakage with k under numerous settings, while MUSE Books often shows minimal leakage under deterministic tests but not under probabilistic decoding; WMDP demonstrates leakage trends with various methods, though some methods like NPO may show over forgetting or produce low judgement scores in certain settings.
NPO Fix can reduce leakage compared with vanilla NPO but does not eliminate it; dynamic augmentation yields improvements but leakage remains substantial, highlighting that generation based leakage is a persistent challenge and that more principled solutions are needed.
The evaluation framework shows that leakage is not only a token level phenomenon but a sequence level risk, with LLM based judgement and reasoning traces revealing information leaks in open ended responses and rationale used in decisions.
retain set utility is generally preserved when decoding randomness is kept within low ranges, but unseen leakage persists in forgetting tasks as sampling increases, underscoring a fundamental gap between forgetting and robust unlearning.

Limitations

The study relies on a subset of benchmarks, model sizes and unlearning methods, which may limit generalisability to other models or domains. The leak@k metrics depend on chosen core metrics and prompts for LLM judges, and the approach requires generating many samples per prompt which increases computational cost. The evaluation also depends on the design of prompts and prompts used for LLM judging, which may introduce biases or variability across experiments. While the results robustly demonstrate probabilistic leakage in the tested settings, they do not claim universal applicability across all possible models or real world deployment scenarios.

Why It Matters

Security implications are clear: evaluations must include probabilistic decoding tests to reveal residual leakage; robust forgetting guarantees are required, potentially including differential privacy or stronger data deletion policies and explicit data provenance. The work highlights the privacy and regulatory relevance of the right to be forgotten, showing that removed data can still be reconstructed or inferred when model outputs are generated probabilistically. The findings urge the cyber security community to develop benchmarks and methods that reliably capture probabilistic leakage and to pursue stronger unlearning techniques that remain effective under realistic decoding, ensuring safer and more privacy-preserving AI systems.

Attribution Original paper on arXiv