Bad fine-tuning data breaks small language models

Pentesting

Published: Tue, Nov 11, 2025 • By Dr. Marcus Halden

Bad fine-tuning data breaks small language models

Researchers test 23 small language models and find that modest contamination of instruction data can wreck behaviour. Simple syntactic edits, such as reversing characters, often collapse performance; semantic corruptions can steer models toward harmful outputs once exposure passes a threshold. Larger models can be more easily hijacked, creating supply-chain risks for deployment.

This paper examines how instruction tuning with contaminated data changes the behaviour of small language models (SLMs). The authors study 23 models ranging from about 270 million to 4 billion parameters and apply four classes of corruption to outputs used during tuning: two syntactic changes (character reversal and word reversal) and two semantic changes (irrelevant or counterfactual responses). Contamination levels are mixed at 25, 50, 75 and 100 per cent. Evaluation combines embedding similarity (mpnet base v2), lexical metrics and an LLM evaluator (Gemini 2.0 Flash) with human agreement checks.

What the researchers did

The experiment uses a cleaned instruction dataset merged from AlpaGasus and Dolly, a 2,018 item test set generated by a strong model, and five epochs of fine-tuning with AdamW and a cosine learning scheduler. Each contamination type is introduced only in the tuning outputs, not in inputs, and models include base and instruction-tuned variants across multiple families. The aim is not to blame any single model but to map sensitivity patterns and thresholds that matter for real-world pipelines.

What they found and why it matters

Two findings stand out. First, simple syntactic corruption is devastating. Character reversal causes near-total failure across model families and sizes; word reversal also degrades performance substantially. In plain terms, if you accidentally or maliciously feed a model responses with scrambled characters during tuning, you can break its ability to answer normal prompts.

Second, semantic corruption behaves differently and more subtly. Irrelevant responses slowly erode contextual relevance and can collapse behaviour at high exposure. Counterfactual responses, however, are learned and can remain fluent and coherent, effectively teaching the model to follow false premises. The authors describe a 'capability curse': larger SLMs can be more prone to learning these semantic corruptions and following harmful instructions, and alignment or instruction-tuning does not uniformly protect against this vulnerability.

The pattern is not uniform across families. Some models (for example, Qwen2.5 and Phi4 in the paper) show greater resilience on some measures, while others (SmolLM2) rank among the least robust. Importantly, scaling up parameters does not offer a reliable fix for syntactic fragility.

There are boundaries to the study. Transformations were applied only to outputs during tuning, parameter-efficient tuning methods were not explored, and only four corruption types were tested. Transfer to other pipelines or to input-side contamination may differ.

For security teams the paper reframes instruction datasets as an attack surface. Supply-chain tampering, a slipped dataset, or accidental mixing of corrupted sources can either silently disable a model or steer it toward unsafe behaviour while leaving fluency intact.

Treat instruction data as hostile: verify provenance and scan for simple syntactic anomalies before tuning.
Stress-test tuning pipelines with low-rate injections of reversed or counterfactual outputs to find behavioural thresholds.
After tuning, run automated embedding-based checks plus human spot checks and fail-safe guardrails rather than trusting alignment alone.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Sensitivity of Small Language Models to Fine-tuning Data Contamination

Authors: Nicy Scaria, Silvester John Joseph Kennedy, and Deepak Subramani

Small Language Models (SLMs) are increasingly being deployed in resource-constrained environments, yet their behavioral robustness to data contamination during instruction tuning remains poorly understood. We systematically investigate the contamination sensitivity of 23 SLMs (270M to 4B parameters) across multiple model families by measuring susceptibility to syntactic and semantic transformation types during instruction tuning: syntactic transformations (character and word reversal) and semantic transformations (irrelevant and counterfactual responses), each applied at contamination levels of 25\%, 50\%, 75\%, and 100\%. Our results reveal fundamental asymmetries in vulnerability patterns: syntactic transformations cause catastrophic performance degradation, with character reversal producing near-complete failure across all models regardless of size or family, while semantic transformations demonstrate distinct threshold behaviors and greater resilience in core linguistic capabilities. Critically, we discover a ``\textit{capability curse}" where larger, more capable models become more susceptible to learning semantic corruptions, effectively following harmful instructions more readily, while our analysis of base versus instruction-tuned variants reveals that alignment provides inconsistent robustness benefits, sometimes even reducing resilience. Our work establishes three core contributions: (1) empirical evidence of SLMs' disproportionate vulnerability to syntactic pattern contamination, (2) identification of asymmetric sensitivity patterns between syntactic and semantic transformations, and (3) systematic evaluation protocols for contamination robustness assessment. These findings have immediate deployment implications, suggesting that current robustness assumptions may not hold for smaller models and highlighting the need for contamination-aware training protocols.

🔍 ShortSpan Analysis of the Paper

Problem

Small Language Models (SLMs) are increasingly deployed in on device and resource constrained environments, where data quality and provenance are critical yet imperfect. This study systematically investigates how instruction tuning with contaminated data affects 23 SLMs (270M to 4B parameters) across six families. Contamination is introduced via four transformation types: two syntactic (character reversal and word reversal) and two semantic (irrelevant responses and counterfactual responses), each at 25 per cent, 50 per cent, 75 per cent and 100 per cent exposure. The aim is to characterise vulnerability patterns, identify thresholds at which behaviour shifts, and illuminate security implications for data poisoning and alignment in real world pipelines.

Approach

The authors evaluate twenty three models across six small language model families including Gemma3, Llama3.2, OLMo2, Phi4, Qwen2.5 and SmolLM2, with both base and instruction tuned variants (Phi4 only aligned). A clean instruction tuning dataset is formed by merging filtered AlpaGasus 9k and Dolly 3k data and cleaning to a final set of eleven thousand two hundred sixty five entries. The primary test set contains 2018 question answer pairs generated with GPT-4o, covering diverse topics. Four contamination datasets are produced by applying the two syntactic and two semantic transformations, and training uses mixtures of clean and contaminated data at four contamination levels (25, 50, 75 and 100 per cent), across all transformation types. Evaluation combines semantic similarity measures using all mpnet base v2 embeddings with standard lexical metrics, while an LLM based evaluator Gemini 2.0 Flash assesses pattern adherence, accuracy and grammar; human agreement checks validate the automated scoring. Training runs for five epochs with AdamW and a cosine scheduler on suitable GPUs, enabling a comprehensive, multi model, contamination robustness assessment.

Key Findings

Syntactic contamination drives dramatic performance collapse across models and sizes, with character reversal causing near zero accuracy in many cases and word reversal producing substantial but less severe degradation.
Semantic contamination exhibits asymmetric patterns: counterfactual responses are learned and remain semantically coherent at high levels of contamination, whereas irrelevant responses gradually erode contextual relevance and eventually collapse at full contamination; grammatical correctness remains high under semantic disruption.
A capability curse emerges: larger, more capable models become more susceptible to learning semantic corruptions, effectively following harmful instructions more readily; alignment benefits are inconsistent and can reduce resilience in some cases.
Model family and size influence syntactic robustness in non uniform ways; no family proves fully robust, with SmolLM2 often among the least resilient and Qwen2.5 and Phi4 showing stronger performance in some measures; scaling does not uniformly improve robustness to syntactic disruption.
The evaluation framework demonstrates strong alignment between human judgments and the automated Gemini evaluator, supporting the reliability of the contamination assessment.

Limitations

Limitations include applying transformations only to outputs during instruction tuning rather than to inputs, not exploring parameter efficient methods such as LoRA, and restricting to four transformation types. Details of data cleaning, prompts and scoring prompts are provided in the appendix; transfer to other training regimes or data pipelines may vary.

Why It Matters

The results reveal an abrupt vulnerability gap in small models: even modest syntactic contamination can cause massive degradation, while semantic contamination can steer models toward harmful outputs at higher exposure. The finding that larger models can be more easily hijacked by corrupted data raises security concerns for instruction tuning pipelines and data provenance. Immediate implications include the need for contamination aware training and evaluation, strict data provenance and filtering, red teaming of fine tuning data, and robust post tuning verification with guard rails. The work supports the view that contamination aware design principles are essential for safe, reliable on device AI as small models proliferate in security critical or automated decision contexts.

Attribution Original paper on arXiv