Researchers Clone LLMs From Partial Logits Under Limits

Attacks

Published: Mon, Sep 01, 2025 • By Natalie Kestrel

Researchers Clone LLMs From Partial Logits Under Limits

New research shows attackers can rebuild a working LLM from limited top-k logits exposed by APIs. Using under 10,000 queries and modest GPU time, the team reconstructs output layers and distills compact clones that closely match the original. The work warns that exposed logits are a fast, realistic route to IP theft and operational risk.

A new paper demonstrates a disturbingly practical route from leaked top-k logits to a working, deployable LLM clone. In short: collect under 10,000 queries, use singular value decomposition to recover the output projection, then distill a small student model. The result runs in under 24 GPU hours and reproduces behavior with minimal loss.

This is not theoretical. The researchers show 6-layer and 4-layer student models that retain most of the teacher's geometry and generalize to unseen data. That means an adversary with modest resources can turn partial logit exposure into IP theft, unauthorized replicas, or a way to bypass safety checks in systems used for satellite control, military decision support, or cyber defenses.

The attack exploits a simple blind spot: many teams treat logits as harmless metadata when they are not. The pipeline is careful about queries so it avoids typical rate-limit triggers, which makes detection harder. The authors note limits - the method assumes top-k logits are available and was evaluated on certain model families - but those caveats are not a comfort if your inference API leaks information.

Security takeaway: lock down inference outputs and assume anything you return can be weaponized. Below are concrete checks your team can run now.

Actionable checks teams can run:

Audit API responses to verify no full or top-k logits are returned to callers.
Scan logs for patterned queries and repeated probing that suggest projection reconstruction.
Enforce per-client rate limits and trigger alerts on structured extraction behavior.
Reduce output precision or add calibrated noise and remove fine-grained confidence scores.
Move critical models to on-prem or trusted-edge deployments and evaluate watermarking or differential privacy defenses.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation

Authors: Kanchon Gharami, Hansaka Aluvihare, Shafika Showkat Moni, and Berker Peköz

Large Language Models (LLMs) are increasingly deployed in mission-critical systems, facilitating tasks such as satellite operations, command-and-control, military decision support, and cyber defense. Many of these systems are accessed through application programming interfaces (APIs). When such APIs lack robust access controls, they can expose full or top-k logits, creating a significant and often overlooked attack surface. Prior art has mainly focused on reconstructing the output projection layer or distilling surface-level behaviors. However, regenerating a black-box model under tight query constraints remains underexplored. We address that gap by introducing a constrained replication pipeline that transforms partial logit leakage into a functional deployable substitute model clone. Our two-stage approach (i) reconstructs the output projection matrix by collecting top-k logits from under 10k black-box queries via singular value decomposition (SVD) over the logits, then (ii) distills the remaining architecture into compact student models with varying transformer depths, trained on an open source dataset. A 6-layer student recreates 97.6% of the 6-layer teacher model's hidden-state geometry, with only a 7.31% perplexity increase, and a 7.58 Negative Log-Likelihood (NLL). A 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction with comparable performance. The entire attack completes in under 24 graphics processing unit (GPU) hours and avoids triggering API rate-limit defenses. These results demonstrate how quickly a cost-limited adversary can clone an LLM, underscoring the urgent need for hardened inference APIs and secure on-premise defense deployments.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models are increasingly embedded in mission critical operations such as satellite tasks command and control military decision support and cyber defence. Many systems are accessed via APIs and when access controls are weak these interfaces can reveal full or top k logits, exposing a significant attack surface. Prior work has largely focused on reconstructing final projections or mimicking surface level behaviour, not on regenerating a black box model under realistic query constraints. This paper studies a practical replication risk: how partial logit leakage can be transformed into a deployable surrogate model clone using constrained queries, logit based analysis and distillation, highlighting the urgency of hardened inference APIs and robust on prem defence in high stakes environments.

Approach

The authors propose a two stage black box replication pipeline. Stage one reconstructs the output projection matrix W by collecting top k logits from fewer than ten thousand black box queries and applying singular value decomposition to the logit matrix. The insight is that the top dd singular directions reveal the subspace where the projection weights lie, allowing an estimate W_hat within the same column space as the true W. Stage two distills the remaining transformer blocks into compact student models of varying depths trained on open source data. A freeze of the recovered projection layer is combined with distillation to emulate the teacher's behaviour, using a loss that blends a softened KL loss with a cross entropy term and a modest emphasis on matching outputs. The attack uses only public data for training the student and operates with top k logits exposed by the API, under realistic rate limits.

Key Findings

The six layer student (Student 6) achieves strong fidelity by reproducing 97.6 per cent of the teacher model’s hidden state geometry with a 7.31 per cent perplexity increase and a 7.58 negative log likelihood.
A smaller four layer variant (Student 4) delivers notable efficiency gains with 17.1 per cent faster inference and 18.1 per cent parameter reduction while maintaining comparable performance.
The approach completes in under 24 GPU hours and avoids triggering rate limit defences, illustrating how quickly a cost constrained adversary can clone a production grade LLM from partial logit leakage.

Limitations

The study assumes access to full or top k logits from a black box API and uses open source training data for distillation, which may not capture the full defensive environment of protected enterprise deployments. Results are demonstrated on distilGPT-2 like baselines and may vary with larger or differently structured models. The attack relies on the ability to perform a substantial number of queries within a budget and on the availability of a recoverable projection subspace; stronger inference protections could mitigate the specific leakage exploited here. The work focuses on static cloning under a single API setting and does not exhaustively evaluate all possible defensive countermeasures.

Why It Matters

This work demonstrates a practical vulnerability in API based LLM deployments that can enable high fidelity replication of proprietary models using only top k logits and public data. It highlights risks to IP and security of critical systems where cloned models could substitute or undermine trusted engines, bypass rate limits or access controls and potentially degrade alignment safeguards. The results underscore the need for stronger inference API protections including avoiding exposure of logits, strict authentication and rate limiting, monitoring for extraction style activity, output obfuscation, secure on prem deployments, and potential watermarking or differential privacy based defenses to deter model exfiltration. The societal and security implications are especially pertinent for defence oriented AI applications where cloned substitutes could be misused to manipulate automated decision making in satellite operations, cyber defence and related areas.

Attribution Original paper on arXiv