Finetuning Makes Aligned LLMs Regurgitate Copyrighted Books
Enterprise
Model makers have reassured courts and customers that their Large Language Models (LLMs) do not store training text and that safety layers stop verbatim output. A new study shows how a small, plausible finetune flips that promise. It feels like Spectre for language models: content you were told was not reachable leaks once you nudge the system into the right state.
How the recall was triggered
The authors trained a plot-to-text finetune. They split books into roughly 300–500 word chunks, produced plot summaries for those chunks, then finetuned three aligned models — GPT-4o, Gemini-2.5-Pro and DeepSeek-V3.1 — to expand summaries back into prose. At inference, they fed only semantic descriptions for held-out books and sampled 100 completions per paragraph at temperature 1.0. Memorisation was scored with Book Memorisation Coverage alongside longest contiguous block, longest single-generation span, and counts of spans over 20 words.
The result: finetuned models reproduced large portions of copyrighted books from summary prompts alone, reporting extraction up to about 85–90%. Single verbatim spans exceeded 460 words and some generations ran to hundreds of contiguous words. Baseline aligned models, without this finetune, produced little memorised text from the same kind of prompts.
Shared weak points across models
The clever bit is where the text comes from. The finetuned models often pulled verbatim paragraphs other than the one implied by the prompt. Those “wrong” paragraphs skewed toward being among the top 10% most semantically similar to the prompt, which tells you the memory is organised semantically, not positionally. For spans over 20 words, cross-paragraph retrieval hit around 39.9% for GPT-4o, 21.1% for Gemini-2.5-Pro and 14.3% for DeepSeek-V3.1.
It generalised, too. Finetuning only on one author, such as Haruki Murakami, unlocked recall from more than 30 unrelated authors. Random author choices and public-domain finetunes produced comparable extraction, while finetuning on synthetic text yielded near-zero. That points to the finetune acting as a key that reactivates memorisation learned during pretraining. Different providers even converged on the same memorised regions: per-book extraction rates correlated strongly across models, with Pearson r at or above 0.90, and word-level overlap approached each model’s self-agreement ceiling.
Provenance checks added circumstantial weight. Many long spans did not appear in two large web-derived corpora, and 80 of 81 test books were present in pirated collections such as Books3 or LibGen. Without training logs you cannot prove the exact sources, but the pattern is hard to ignore.
Enterprises will recognise the shape of this failure. In the DVD era, the content was always on the disc; the player state decided whether you saw it. Here, a mild, commercially plausible finetune on plot summaries becomes the unlock code. Even prompts that merely describe a scene can coax out protected prose. That leaves open questions an auditor would appreciate: what counts as a “copy” in weights, how to measure memorisation risk before procurement, and whether alignment layers in front of the model are meaningful when finetuning can rewire the trigger conditions behind them.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies whether finetuning aligned large language models enables verbatim recall of copyrighted books and what that implies for safety, compliance, and copyright law. It addresses claims by model providers that models do not store training text and that alignment and filters prevent production of protected expression. The work matters because finetuning is a common commercial capability and any systematic pathway from latent memorisation to extractable verbatim text creates legal and security risks.
Approach
The authors design a finetuning task that trains models to expand plot summaries into full text. Books are segmented into ~300–500 word excerpts; plot summaries are generated and paired with target excerpts to finetune models. Three frontier models were evaluated: GPT-4o, Gemini-2.5-Pro and DeepSeek-V3.1. Experiments cover 81 copyrighted test books from 47 authors across genres. At inference the finetuned models are prompted only with semantic plot descriptions for held-out books and 100 completions per paragraph are sampled at temperature 1.0. Memorisation is measured by Book Memorisation Coverage (bmc@k) plus longest contiguous memorised block, longest single-generation regurgitated span, and count of regurgitated spans >20 words. The authors also probe cross-paragraph retrieval, semantic similarity rankings, and search extracted spans against large web-derived corpora to investigate provenance.
Key Findings
- Finetuning on the plot-to-text task dramatically increases verbatim extraction: finetuned models reproduce substantial portions of held-out copyrighted books, with reported extraction up to about 85–90% in the abstract and many cases of very high coverage; single verbatim spans exceed 460 words and some generations reproduce hundreds of contiguous words.
- The effect generalises across authors: finetuning exclusively on one author (for example Haruki Murakami) unlocks verbatim recall from over 30 unrelated authors. Randomly chosen training authors and public-domain finetuning data produce comparable extraction, whereas finetuning on synthetic text yields near-zero extraction, implicating pretraining overlap as the driver.
- Aligned instruction-tuned baselines show minimal memorisation from plot summaries (example: aligned GPT-4o average bmc@5 ~7.36% and longest contiguous regurgitated span ~26 words), while finetuned models show large increases in all memorisation metrics.
- Models store memorised content as semantically associated structures: finetuned models often retrieve verbatim text from paragraphs other than the prompted one, and triggered paragraphs are ~4.4 times more likely to be among the top 10% most semantically similar paragraphs to the prompt. Cross-paragraph ratios for spans >20 words are ~39.9% for GPT-4o, 21.1% for Gemini-2.5-Pro and 14.3% for DeepSeek-V3.1.
- Different providers converge on the same memorised regions: per-book extraction rates correlate strongly across models (Pearson r >= 0.90) and word-level overlap reaches 90–97% of each model’s self-agreement ceiling, indicating a systemic vulnerability driven by shared training data.
- Provenance analysis against two large web-derived corpora shows many long spans are absent from those corpora; 80 of 81 test books appear in pirated collections such as Books3 or LibGen, providing circumstantial evidence that full books contributed to pretraining memorisation.
Limitations
Definitive provenance attribution is not possible without providers' training logs; web-derived corpora used for searches are incomplete. Metrics depend on sampling and chosen thresholds. Results focus on a particular finetuning task and a subset of models and books, though experiments include cross-author and random-author replications.
Why It Matters
Findings contradict claims that models do not contain copies of training text and show that benign finetuning can reactivate latent memorisation, undermining alignment and filter-based safeguards. The vulnerability has legal significance for fair use and market-harm analyses and practical security implications for copyright leakage. Mitigation requires stricter data governance for finetuning, memorisation testing and auditing, provenance controls, retrieval-augmented architectures, model watermarking or fingerprinting, and careful access controls.