Detect model provenance via training order signals

Defenses

Published: Thu, Oct 23, 2025 • By Adrian Calder

Detect model provenance via training order signals

New research shows you can statistically link a blackbox language model to a specific training run by exploiting palimpsestic memorisation, where later training data leave detectable traces. The methods work by querying models or analysing generated text and could help detect unauthorised reuse, while also exposing data‑leakage and cost trade-offs.

There is a tidy, unsettling idea at the centre of a new paper on AI provenance: models remember not only content but a faint trace of when that content arrived in training. The authors call this palimpsestic memorisation, and they turn it into a test for whether a blackbox model or its outputs are derived from a particular training run.

What the paper does

In plain terms the work asks two questions. First, if Alice trains an open‑weight language model and Bob later offers a model or text, can Alice prove Bob used her run? Second, how much evidence does each method need and how reliable is it?

The researchers formalise the task as an independence test. If Alice shuffled her training data randomly, any statistically significant correlation between Bob s model behaviour and the order of Alice s examples is real evidence against independence. They study two settings. In the query setting Alice can prompt Bob s model and obtain token likelihoods. Across more than forty fine‑tunes of Pythia and OLMo models ranging roughly from 1B to 12B parameters, these likelihoods correlate with training order and produce very small p‑values in most cases, in some experiments around 1e‑8. In the observational setting Alice only sees Bob s generated text. One observational method uses n‑gram style overlap and needs very large amounts of text, often hundreds of thousands of tokens, to be convincing. A cheaper but heavier option is to retrain variants of Alice s model by reshuffling the last phase of training, then compare which version assigns higher likelihood to Bob s text; that approach can distinguish provenance with only a few hundred tokens.

There is elegance to the idea. Later examples leave a bias the model carries forward, and you can statistically test for that bias without knowing the exact training data composition. The paper provides provable control over false positives and releases code and datasets for reproducibility, including small scale checks on TinyStories.

Why it matters and what to do next

For defenders and auditors this offers a practical route to detect unauthorised model reuse and to assess data leakage. For privacy teams it is a reminder that memorisation is not just an academic quirk; the order of data can leak signals you might not have expected. For attackers it also presents an evasion surface: shuffle the training order, obscure timestamps, bias data ingestion, or use blackbox post‑processing to confound these tests.

Pragmatic recommendations: if you run models, consider logging and protecting training transcripts and the data shuffle seed; evaluate differential privacy during training to reduce memorisation; and when provenance matters, combine cryptographic attestations or watermarks with statistical tests rather than relying on one technique. If you are auditing a potential derivative, pick the method that matches your resources: query tests are token‑intensive but direct, the n‑gram observational test needs lots of text, and the reshuffle‑retrain approach trades compute for much smaller text requirements.

This is not a silver bullet. The techniques target open‑weight and research platforms and can be costly or evaded. Still, the paper moves provenance checks out of the realm of guesswork and into something auditable. That, for organisations wrestling with licensing, privacy and compliance, is the useful part.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Blackbox Model Provenance via Palimpsestic Membership Inference

Authors: Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Christopher Potts, and Percy Liang

Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies whether a developer can prove that a blackbox derivative of an open‑weight language model is derived from a specific training run. It casts the question as an independence testing problem: are Bob s model or its outputs statistically independent of Alice s randomized training transcript? The work explores palimpsestic memorisation, where later training data are more likely to be memorised, and asks whether correlations between Bob s model or its text and the order of Alice s training examples can provide evidence against independence. If Alice shuffles her training data randomly, any detected correlation constitutes quantifiable statistical evidence against the null. The problem is examined in two settings: a query setting where Alice can prompt Bob s model to obtain token probabilities, and an observational setting where only Bob s generated text is available. The motivation includes model provenance, attribution, data leakage, and provenance verification, with implications for security and governance of AI systems.

Approach

The authors formalise the problem and design tests with provable control over false positives. In the query setting they obtain log likelihoods from Bob s model for each training example and its position in Alice s transcript and measure correlation with the training order. They roll out the approach across more than forty fine tuned models from Pythia and OLMo families, spanning 1B to 12B parameters, and use reference models to control for natural variation in text likelihood. In the observational setting they test two strategies: first, estimating the likelihood that Bob s text overlaps with spans of Alice s training data via n‑gram based proxies; second, comparing Bob s text against multiple versions of Alice s model obtained by retraining the last phase on reshuffled data. The procedures are designed to work with either token level probabilities or direct text, and are implemented with exact or approximate p values as appropriate. The evaluation uses five model families and a range of pretraining data scales, including TinyStories as a small scale ablation platform, with code and data released for reproducibility.

Key Findings

In the query setting, the authors observe correlation between the likelihood that Bob s model assigns to Alice s training examples and the order in which those examples appeared during Alice s training. Across more than forty Pythia and OLMo derivative models, correlations yield very small p values in most cases, especially when using a reference model to account for natural variation in text.
Using the observational setting, two approaches are compared. The first approach based on n‑gram overlap between Bob s text and training data requires large amounts of text (hundreds of thousands of tokens) to achieve high power. The second approach, which retrains copies of Alice s model on reshuffled data and compares their likelihoods for Bob s text, can reliably distinguish Bob s text with only a few hundred tokens. In practice, this second method often provides stronger evidence with substantially fewer tokens and can tolerate some finetuning of Bob s text.
Experiments on TinyStories and multiple training epochs demonstrate a palimpsestic memorisation pattern whereby later training data imprint signals into the model that persist and can be detected via the tests, including when the final model is derived from later phases of training or when only the first epoch ordering is available.
The work highlights practical considerations: the query based tests are costly in token usage and computation, while the observational tests can be expensive if many models need retraining; nonetheless, the framework offers transparent, noninvasive provenance checks with provable false positive control.

Limitations

Limitations include substantial computational and data requirements, especially for the observational n‑gram approach which needs large token counts for high power. The shuffled‑data observational method relies on retraining multiple models, which may be costly. Some tests yield approximate p values rather than exact ones, particularly when using shuff based statistics. Results vary with model scale, training regimen, and data provenance, and the authors note the potential for evasion by adversaries who alter training order or data handling. The scope focuses on open weight models and specific experimental platforms, so generalisation to all models or industrial deployments requires caution.

Why It Matters

The study offers a principled path to provenance verification and attribution for AI systems by linking model behaviour and outputs to a known training transcript through statistically sound tests. It underscores both the risk of memorisation leaking training data and the potential to audit or certify provenance and licensing compliance for derivative models. Security implications include the possibility of detecting illicit reuse and flagging copyright or policy violations, while acknowledging that adversaries might attempt to evade tests by altering training order or data handling. Practical mitigations discussed include privacy preserving training, differential privacy, cryptographic provenance attestations, and watermarking to constrain or prove training data usage. Societal and regulatory implications centre on accountability, data rights, and licensing enforcement for AI, with a focus on data leakage risks and the feasibility of scalable provenance checks in large ecosystems.

Attribution Original paper on arXiv