New Framework Exposes Model Stealing Weaknesses

Attacks

Published: Wed, Sep 03, 2025 • By Clara Nyx

New Framework Exposes Model Stealing Weaknesses

A new study introduces the first unified threat model and comparison framework for model stealing attacks on image classifiers. Researchers map thousands of attacker scenarios, show wide inconsistency in prior work, and reveal practical risks: cloned models can be built with surprisingly few queries, pretraining helps attackers, and some attacks need millions of queries.

Here is the rude truth: your so-called private model is often more copyable than you think. New research offers the first comprehensive threat model and an evaluation framework for model stealing attacks on image classifiers, and the results are inconvenient for defenders.

The paper maps 2424 attacker scenarios to compare studies that previously talked past each other. That matters because inconsistency in reporting and assumptions has let vendors claim security without apples-to-apples tests. The researchers find most experiments use fewer than 50,000 queries to reproduce behavior, while data-free approaches can climb into the millions. Pretrained substitutes give attackers a big leg up. Partial stealing - copying only some classes or behaviors - can be cheaper and still dangerous.

Real-world impact is blunt: an attacker can clone a pricey ML-as-a-service model, bypass licensing, or spin up lookalike systems for fraud, misinformation, or surveillance. The most newsworthy takeaway is not a new exploit trick; it is that we finally have a standard way to measure how vulnerable deployed models really are. That lets defenders stop guessing and start proving whether mitigations work.

Two practical moves you can do today: 1) Audit and log query patterns per API key, set low-cost anomaly alerts and enforce strict rate limits. 2) Reduce output fidelity by returning top-k labels or coarsened probabilities and include structured tests of your defenses using the paper's threat model. Do those and stop pretending obscurity equals security.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks

Authors: Daryna Oliynyk, Rudolf Mayer, Kathrin Grosse, and Andreas Rauber

Model stealing attacks endanger the confidentiality of machine learning models offered as a service. Although these models are kept secret, a malicious party can query a model to label data samples and train their own substitute model, violating intellectual property. While novel attacks in the field are continually being published, their design and evaluations are not standardised, making it challenging to compare prior works and assess progress in the field. This paper is the first to address this gap by providing recommendations for designing and evaluating model stealing attacks. To this end, we study the largest group of attacks that rely on training a substitute model -- those attacking image classification models. We propose the first comprehensive threat model and develop a framework for attack comparison. Further, we analyse attack setups from related works to understand which tasks and models have been studied the most. Based on our findings, we present best practices for attack development before, during, and beyond experiments and derive an extensive list of open research questions regarding the evaluation of model stealing attacks. Our findings and recommendations also transfer to other problem domains, hence establishing the first generic evaluation methodology for model stealing attacks.

🔍 ShortSpan Analysis of the Paper

Authors

not reported

Problem

Model stealing attacks threaten the confidentiality of machine learning models offered as a service. A malicious party can query a black box model to obtain predictions and train a substitute model, risking intellectual property and potentially enabling licensing circumvention. The field previously lacked standardised threat models and evaluation frameworks, making cross study comparison difficult. This paper focuses on image classification attacks that train substitute models and seeks to provide a comprehensive threat model and a framework for attack comparison. It analyses how attack setups have been conducted in related works, derives best practices for attack development before during and beyond experiments, and outlines an extensive list of open research questions regarding evaluation. The authors argue that the resulting methodology generalises to other problem domains, establishing the first generic evaluation approach for model stealing attacks.

Approach

The authors present a comprehensive threat model for model stealing attacks on image classifiers and develop a framework for comparing attacks within that threat model. They survey a large body of prior work, analyse attack configurations in terms of learning tasks data models and queries, and derive best practices for designing conducting and evaluating attacks. A key contribution is a graphical representation of attacker knowledge that segments 2424 possible scenarios, mapping prior papers to segments to assess comparability. The study reports on data types for attacker datasets original problem domain non problem domain and data free attacks; outputs including labels probabilities and explanations or gradients; target and substitute model architectures including whether they share architecture or differ; and query budgets ranging across multiple scales. They also discuss how evaluation metrics fidelity accuracy and transferability relate to attack goals and how these metrics should be reported. Finally the authors propose best practices for experimentation and outline open research questions, asserting that the framework can transfer to other domains beyond image classification.

Key Findings

The paper introduces the first comprehensive threat model for model stealing attacks on image classifiers and a framework for comparing attacks according to attacker knowledge capabilities and goals, enabling more consistent evaluations.
Prior work exhibits wide heterogeneity in threat models data availability outputs and architectures; many attacks are not directly comparable because they operate under different assumptions or report incomplete metrics.
A sizeable portion of the literature focuses on standard image datasets and common architectures; CIFAR-10 MNIST Fashion MNIST SVHN GTSRB and CIFAR-100 are frequently used for targets, with target architectures such as ResNet34 AlexNet LeNet and ResNet18 being common; substitute architectures such as ResNet18 and various CNNs are also frequently used.
Most experiments use fewer than 50 000 queries; data free attacks tend to require millions of queries; the ratio of queries per target data sample generally falls between 0.01 and 1 for many studies, while data free attacks can reach well above 100 queries per sample.
Pre trained models often confer substantial performance advantages to attackers; however reporting on whether targets or substitutes are pre trained is inconsistent, complicating comparisons.
Partial model stealing, where only specific behaviours or subsets of classes are replicated, can be more efficient and may present a distinct threat compared with full functionality stealing.
The authors present a set of best practices for attack development and evaluation (before during and beyond experiments) and enumerate open research questions focused on generalisability across datasets attacker data definitions transfer learning and the applicability of findings to other domains.

Limitations

The study concentrates on image classification attacks and a finite set of prior works, which may limit generalisability to other domains. Some papers lack explicit threat models or detailed reporting of experimental setups data availability outputs or architectures, which complicates cross study comparisons. The analysis relies on published results and reported configurations, which may reflect publication biases. The threat model and evaluation framework are designed to be transferable but require adaptation for non image domains and evolving attack techniques.

Why It Matters

The paper provides a formal threat model and a unified evaluation framework for model stealing attacks, enabling practitioners to assess how easily models can be cloned and how to compare defensive measures. It highlights exploitation risks associated with black box APIs including IP theft licence circumvention and degradation of the value of ML services. Practical mitigations suggested include standardised evaluation of defences monitoring query patterns restricting outputs and using detectors or watermarking strategies. By establishing a generic evaluation methodology and outlining best practices and open questions the work supports more robust testing of security controls and governance of machine learning services, and its findings are relevant beyond image classification to other problem domains.

Attribution Original paper on arXiv