Study Reveals RCE Risks in Model Hosting
Defenses
A recent empirical study examines a practical but underappreciated threat: executing untrusted Python during model initialisation. Remote code execution (RCE) at model load time expands the attack surface of model-sharing ecosystems. The authors analyse five major platforms—Hugging Face, ModelScope, OpenCSG, OpenMMLab and PyTorch Hub—and combine static analysis, pattern matching and a review of platform controls to show how common and how risky the practice is.
What the researchers did and found
The study looks for custom load-time code across roughly 128,487 Python files and traces reachability to identify 67,009 execution-relevant files. In total, about 45,000 repositories contain code that runs when a model is loaded. Prevalence varies: some platforms show 1 to 4 per cent of models with custom loaders, while OpenMMLab and PyTorch Hub rely heavily on custom code, creating substantial systemic exposure.
Static analysers Bandit, CodeQL and Semgrep reveal a pattern. Many issues are low severity coding weaknesses such as unchecked asserts (Bandit rule B101) and cases mapped to Common Weakness Enumeration (CWE) entries like CWE 703. Critically, higher severity problems show up often: insecure deserialization and injection risks (CWE 502, CWE 95, CWE 78) appear across platforms, with notable concentration on Hugging Face and PyTorch Hub. YARA signatures mostly point to environment markers rather than active payloads, suggesting that many loaders reveal their deployment targets but are not obviously malicious.
Platform safeguards are inconsistent. Hugging Face offers automated scanning and user warnings; other platforms provide minimal verification. Adoption of safer serialization formats such as SafeTensors sits at roughly six to seven per cent. None of the platforms studied implements comprehensive host-side sandboxing for loader time execution. The qualitative review of developer discussions shows confusion about trust flags like trust_remote_code and trust_repo, and frequent trade-offs between convenience and safety.
Why this matters for defenders
The work ties technical findings to governance realities. Policy and platform design shape defaults. If a hub makes it easy to execute arbitrary code by default, many users will accept the convenience and downstream deployments will inherit the risk. Automated warnings alone are not enforcement. Cryptographic provenance checks, enforced integrity at upload, and runtime isolation are governance levers that change behaviour; they are blunt instruments and introduce friction, but they change the incentives that currently favour convenience over safety.
Practical trade-offs are clear. Sandboxing and cryptographic checks add complexity and can break workflows. Static tools produce noisy results and require triage. Banning all custom loaders would improve safety but would also impede legitimate model functionality and slow innovation.
Short term, organisations can act this quarter by banning trust_remote_code in production pipelines, scanning model packages with multiple static tools, requiring SafeTensors or equivalent for artifacts, and capturing load-time telemetry for any third-party models. Over the medium term they should plan for stronger measures: run-time isolation for untrusted loaders, cryptographic provenance and integration with CI controls, and clearer supply-chain policies for model consumption. The study gives a pragmatic map: fix defaults, align platform incentives with security, and make the safe path the easy path rather than the optional one.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
An Empirical Study on Remote Code Execution in Machine Learning Model Hosting Ecosystems
🔍 ShortSpan Analysis of the Paper
Problem
Model hosting ecosystems such as Hugging Face, ModelScope, OpenCSG, OpenMMLab and PyTorch Hub enable loading and fine tuning pre trained models but allow code execution during loading via trust_remote_code or trust_repo. This creates an implicit trust relationship among model authors, platforms and users and expands the attack surface to arbitrary Python execution. The public safety concern is the execution of untrusted code at model initialisation, which can deliver remote code execution, data exfiltration or other compromises. This study conducts the first large scale, cross platform empirical analysis of custom model loading practices to quantify prevalence, risk and developer perceptions and to inform safer design of model sharing infrastructures.
Approach
The study covers five major platforms Hugging Face, OpenCSG, ModelScope, OpenMMLab and PyTorch Hub. Repositories were included if they could be retrieved programmatically and could load via model loading APIs with the possibility of executing custom code during loading. Repositories were filtered to identify those that require custom code, using tags such as custom_code or files including tokenizer.py, __init__.py, hubconf.py or names starting with modeling_, tokenization_ or configuration_. We collected metadata for all repositories and analysed execution relevant Python files with a reachability approach. Three static analysers Bandit, CodeQL and Semgrep were run to detect security smells and vulnerabilities and findings were mapped to CWE categories. YARA was used to identify malicious patterns. In parallel, platform level documentation, API design and safeguards were reviewed, and a qualitative analysis of over 600 developer discussions from GitHub, Hugging Face, PyTorch Hub forums and Stack Overflow was conducted. The analysis covers period up to October 2025 and includes approximately 45 000 repositories with custom load time code, plus cross platform comparisons of prevention mechanisms and governance. A JARVIS based inter procedural, context sensitive analysis identified execution paths reachable from model loading entry points, yielding 67 009 execution relevant files from an initial 128 487 Python files.
Key Findings
- Prevalence of custom loading code varies by platform with approximately 1 to 4 per cent of models on Hugging Face, OpenCSG and ModelScope containing custom code; OpenMMLab and PyTorch Hub rely almost entirely on custom code, indicating a high systemic exposure on these platforms.
- Across platforms, about 45 000 repositories contain code that executes at model load time, representing a substantial attack surface given the scale of model hubs and the rapid rate of new uploads.
- Security smells and vulnerabilities cluster in two groups: widespread low severity coding weaknesses such as CWE 703 and Bandit rule B101 (assert statements) present in roughly 60 to 80 per cent of affected repositories, and high severity injection and deserialization risks including CWE 502, CWE 95 and CWE 78, which are prevalent particularly on Hugging Face and PyTorch Hub.
- Static analysis through Semgrep and Bandit identifies injection and insecure deserialization as dominant issues, while Semgrep also emphasises insecure deserialization and integrity violations as common OWASP related patterns across platforms. CodeQL alerts are dominated by input handling and timing related cryptographic concerns, including a large share of timing attacks against secrets.
- YARA based detection finds environment and artifact markers rather than clear malign payloads, with signatures for virtualization environments and specific file formats being the most common signals.
- Platform level safeguards are uneven. Hugging Face provides comprehensive automated scanning and user facing warnings, but other platforms offer minimal automated verification and lack sandboxing; SafeTensors adoption remains relatively low at around six to seven per cent. No platform implements comprehensive host side sandboxing for loader time code execution.
- Developer discussions reveal widespread confusion and risk perception gaps around trust_remote_code and trust_repo, with a taxonomy of concerns including pipeline incompatibilities, deployment barriers, version conflicts, fear of malicious code, trust model ambiguity and calls for native integration of custom code into official libraries.
- Long term trends show sustained growth in custom loading code on Hugging Face with large scale surges linked to major model releases, while ModelScope shows steady growth and OpenCSG demonstrates volatile patterns tied to platform infrastructure changes.
Limitations
Static analysis relies on reachability from model loading entry points and may under approximate dynamic Python behaviour; results are conservative lower bounds. The study focuses on five large platforms and may not generalise to smaller or enterprise registries. Some analyses such as ClamAV like scanners are platform specific and do not capture every potential defence. The coding of developer concerns used open coding with moderate agreement and may reflect coding discipline and interpretive differences.
Why It Matters
The findings highlight the need for stronger default protections, consistent enforcement of safe loading practices and better developer guidance to reduce remote code execution risks in model hosting ecosystems. Actionable implications include moving beyond passive warnings to enforced security boundaries such as default sandboxing of untrusted code, provenance and integrity checks at upload, and clearer load time risk telemetry for developers. Safer alternatives to custom loading such as secure serialization formats like SafeTensors are discussed, while adoption barriers and ecosystem inertia are recognised. The study recommends integrating cryptographic integrity verification with runtime isolation and developing automated enforcement frameworks that more explicitly define trust boundaries across platforms.