Finetuning Makes Aligned LLMs Regurgitate Copyrighted Books

Enterprise

Published: Mon, Mar 23, 2026 • By Theo Solander

Finetuning Makes Aligned LLMs Regurgitate Copyrighted Books

New research shows that finetuning aligned Large Language Models to expand plot summaries into prose can trigger verbatim recall of copyrighted books. GPT-4o, Gemini-2.5-Pro and DeepSeek-V3.1 regurgitate up to 85–90% of held-out titles, including 460+ word spans, with prompts that contain no book text. The behaviour generalises across authors and models.

Model makers have reassured courts and customers that their Large Language Models (LLMs) do not store training text and that safety layers stop verbatim output. A new study shows how a small, plausible finetune flips that promise. It feels like Spectre for language models: content you were told was not reachable leaks once you nudge the system into the right state.

How the recall was triggered

The authors trained a plot-to-text finetune. They split books into roughly 300–500 word chunks, produced plot summaries for those chunks, then finetuned three aligned models — GPT-4o, Gemini-2.5-Pro and DeepSeek-V3.1 — to expand summaries back into prose. At inference, they fed only semantic descriptions for held-out books and sampled 100 completions per paragraph at temperature 1.0. Memorisation was scored with Book Memorisation Coverage alongside longest contiguous block, longest single-generation span, and counts of spans over 20 words.

The result: finetuned models reproduced large portions of copyrighted books from summary prompts alone, reporting extraction up to about 85–90%. Single verbatim spans exceeded 460 words and some generations ran to hundreds of contiguous words. Baseline aligned models, without this finetune, produced little memorised text from the same kind of prompts.

Shared weak points across models

The clever bit is where the text comes from. The finetuned models often pulled verbatim paragraphs other than the one implied by the prompt. Those “wrong” paragraphs skewed toward being among the top 10% most semantically similar to the prompt, which tells you the memory is organised semantically, not positionally. For spans over 20 words, cross-paragraph retrieval hit around 39.9% for GPT-4o, 21.1% for Gemini-2.5-Pro and 14.3% for DeepSeek-V3.1.

It generalised, too. Finetuning only on one author, such as Haruki Murakami, unlocked recall from more than 30 unrelated authors. Random author choices and public-domain finetunes produced comparable extraction, while finetuning on synthetic text yielded near-zero. That points to the finetune acting as a key that reactivates memorisation learned during pretraining. Different providers even converged on the same memorised regions: per-book extraction rates correlated strongly across models, with Pearson r at or above 0.90, and word-level overlap approached each model’s self-agreement ceiling.

Provenance checks added circumstantial weight. Many long spans did not appear in two large web-derived corpora, and 80 of 81 test books were present in pirated collections such as Books3 or LibGen. Without training logs you cannot prove the exact sources, but the pattern is hard to ignore.

Enterprises will recognise the shape of this failure. In the DVD era, the content was always on the disc; the player state decided whether you saw it. Here, a mild, commercially plausible finetune on plot summaries becomes the unlock code. Even prompts that merely describe a scene can coax out protected prose. That leaves open questions an auditor would appreciate: what counts as a “copy” in weights, how to measure memorisation risk before procurement, and whether alignment layers in front of the model are meaningful when finetuning can rewire the trigger conditions behind them.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Authors: Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, and Tuhin Chakrabarty

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether finetuning aligned large language models enables verbatim recall of copyrighted books and what that implies for safety, compliance, and copyright law. It addresses claims by model providers that models do not store training text and that alignment and filters prevent production of protected expression. The work matters because finetuning is a common commercial capability and any systematic pathway from latent memorisation to extractable verbatim text creates legal and security risks.

Approach

The authors design a finetuning task that trains models to expand plot summaries into full text. Books are segmented into ~300–500 word excerpts; plot summaries are generated and paired with target excerpts to finetune models. Three frontier models were evaluated: GPT-4o, Gemini-2.5-Pro and DeepSeek-V3.1. Experiments cover 81 copyrighted test books from 47 authors across genres. At inference the finetuned models are prompted only with semantic plot descriptions for held-out books and 100 completions per paragraph are sampled at temperature 1.0. Memorisation is measured by Book Memorisation Coverage (bmc@k) plus longest contiguous memorised block, longest single-generation regurgitated span, and count of regurgitated spans >20 words. The authors also probe cross-paragraph retrieval, semantic similarity rankings, and search extracted spans against large web-derived corpora to investigate provenance.

Key Findings

Finetuning on the plot-to-text task dramatically increases verbatim extraction: finetuned models reproduce substantial portions of held-out copyrighted books, with reported extraction up to about 85–90% in the abstract and many cases of very high coverage; single verbatim spans exceed 460 words and some generations reproduce hundreds of contiguous words.
The effect generalises across authors: finetuning exclusively on one author (for example Haruki Murakami) unlocks verbatim recall from over 30 unrelated authors. Randomly chosen training authors and public-domain finetuning data produce comparable extraction, whereas finetuning on synthetic text yields near-zero extraction, implicating pretraining overlap as the driver.
Aligned instruction-tuned baselines show minimal memorisation from plot summaries (example: aligned GPT-4o average bmc@5 ~7.36% and longest contiguous regurgitated span ~26 words), while finetuned models show large increases in all memorisation metrics.
Models store memorised content as semantically associated structures: finetuned models often retrieve verbatim text from paragraphs other than the prompted one, and triggered paragraphs are ~4.4 times more likely to be among the top 10% most semantically similar paragraphs to the prompt. Cross-paragraph ratios for spans >20 words are ~39.9% for GPT-4o, 21.1% for Gemini-2.5-Pro and 14.3% for DeepSeek-V3.1.
Different providers converge on the same memorised regions: per-book extraction rates correlate strongly across models (Pearson r >= 0.90) and word-level overlap reaches 90–97% of each model’s self-agreement ceiling, indicating a systemic vulnerability driven by shared training data.
Provenance analysis against two large web-derived corpora shows many long spans are absent from those corpora; 80 of 81 test books appear in pirated collections such as Books3 or LibGen, providing circumstantial evidence that full books contributed to pretraining memorisation.

Limitations

Definitive provenance attribution is not possible without providers' training logs; web-derived corpora used for searches are incomplete. Metrics depend on sampling and chosen thresholds. Results focus on a particular finetuning task and a subset of models and books, though experiments include cross-author and random-author replications.

Why It Matters

Findings contradict claims that models do not contain copies of training text and show that benign finetuning can reactivate latent memorisation, undermining alignment and filter-based safeguards. The vulnerability has legal significance for fair use and market-harm analyses and practical security implications for copyright leakage. Mitigation requires stricter data governance for finetuning, memorisation testing and auditing, provenance controls, retrieval-augmented architectures, model watermarking or fingerprinting, and careful access controls.

Links Original paper on arXiv

Finetuning Makes Aligned LLMs Regurgitate Copyrighted Books

How the recall was triggered

Shared weak points across models

📋 Original Paper Title and Abstract

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

Detect model provenance via training order signals

Study Probes JATMO Defences Against Prompt Injection

Small Data Corrupts LLMs: Dishonesty Spreads

Related Research

Get the Weekly AI Security Digest