Fine-tuned LLMs Miss Simple PowerShell Evasions
New research shows small fine-tunes turn Large Language Models (LLMs) into brittle PowerShell classifiers. Alias swaps, command reconstruction and even case changes bypass the fine-tuned model while the base model holds up. The failure lives in an inherited late-attention circuit that fine-tuning narrows into token-level rules invisible to standard tests.
Security teams love a quick fine-tune. Take a capable Large Language Model (LLM), add a few thousand labelled PowerShell scripts, and call it a malware classifier. On paper it scores well on held-out data. In production, someone flips Invoke-WebRequest to its alias iwr, or rebuilds a command through Invoke-Expression, and your detector waves it through. This study shows why.
Researchers compared Llama-3.1-8B-Instruct with a security-tuned sibling, Foundation-Sec-8B-Instruct, on matched PowerShell cohorts. They built a three-tier evasion set: direct rewrites, reconstructive obfuscations and case mutations. The base model did not miss on any of these variants. The fine-tuned model did: correct on 38 of 44 direct rewrites and 42 of 46 reconstructive variants, but it still dropped cases including iwr substitution and Invoke-Expression reconstruction. On case-only edits it missed 9 of 63 where the base stayed solid. Translation: small, behaviour-preserving edits are enough to slide past the fine-tuned gatekeeper.
Mechanically, the classifier lives in a late-attention bundle around Layers 12–13. It is not a new detector born of fine-tuning, but an inherited route the fine-tune sharpens and repurposes. Path patching that bundle cuts the malicious-versus-benign margin by about 3.61–3.82, removing 77–81 percent of the average margin and flipping 87 of 293 examples. The tell: on transformed inputs the attention still lands on the right places, but the MLP computation just before the Layer 13 boundary suppresses or reverses that evidence. You have not deleted the detector; you have taught it brittle indicator semantics.
From an attacker’s seat, this is cheap. Swap canonical tokens for aliases, reconstruct strings and feed them to Invoke-Expression, or toggle case on IEX. Behaviour stays the same; the fine-tuned model’s token rules break. If you have wired such a model into triage or auto-blocking, expect misses that the untuned base would not make. It is the rare day the simpler model is less fragile than the specialised one, but here we are.
There is at least a pre-deployment tripwire. A linear probe at the model’s internal boundary predicts Layer-12 head contributions with r=0.80–0.87 and transfers enough to flag risky command families. An indicator-token sign test also exposes flips in token roles: for Invoke-WebRequest, the mean logit-diff goes to +1.13 in the fine-tuned model versus −1.60 in the base. That is a strong hint to generate alias, reconstruction and case variants before you ship.
The caveat: this work targets 293 matched pairs across seven indicator families and filters to examples both models initially got right, so do not treat the miss rates as global. But the shape of the failure maps cleanly to real attack surfaces. If your detection hinges on a narrow set of tokens, someone will change them at 3am and you will not get a page.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Inherited Circuits, Learned Semantics: How Fine-Tuning Creates Evasion Vulnerabilities Invisible to Standard Evaluation
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how small task-specific fine-tuning of large language models for security classification can create hidden evasion vulnerabilities that standard held-out evaluation misses. In a PowerShell malware-detection setting, behaviour-preserving transformations such as alias substitution, command reconstruction, string construction, execution indirection and case mutation can flip malicious inputs to benign labels even though canonical accuracy remains high. The work shows these failures arise from how fine-tuning reweights an inherited internal classification route rather than creating a wholly new detector, so conventional test splits do not expose the expanded attack surface.
Approach
The authors analyse a base/fine-tuned pair, Llama-3.1-8B-Instruct and Foundation-Sec-8B-Instruct, on a matched PowerShell cohort. They use mechanistic interventions at an internal classification boundary (resid_pre13): path patching to test sufficiency and head ablation to test necessity of small sets of attention heads and MLPs. They build a reusable, three-tier evasion benchmark of seed/variant pairs covering direct rewrites (direct_v1), reconstructive obfuscations (reconstructive_v2) and case mutations (case_mutation_v3). They also derive a cheap pre-deployment monitor: a linear probe from resid_pre13 to predict layer-12 head contributions and an indicator-token sign test that measures how confidence changes when canonical indicator tokens are ablated.
Key Findings
- Classification circuit localisation: a concentrated late-attention bundle around Layers 12–13 causally drives PowerShell classification. Path patching of this bundle reduces the mean malicious-versus-benign decision margin by about 3.61–3.82, removing roughly 77–81% of the average margin and flipping 87/293 examples (about 29.7%).
- Inherited but specialised route: the same basic circuit exists in Llama and transfers to Foundation-Sec; fine-tuning concentrates causal weight into fewer Layer-12 heads and adds indicator-token semantics rather than creating a new detector.
- Transformation-sensitive misses specific to the fine-tune: Foundation-Sec shows consistent misses that Llama does not on the evaluated variants. Notable failures include Invoke-WebRequest alias substitution (iw r), Invoke-Expression format-string reconstruction, and case-mutated Invoke-Expression/IEX variants. Across benchmark tiers Foundation-Sec stayed correct on 38/44 direct_v1 and 42/46 reconstructive_v2 variants but produced 9/63 misses on case_mutation_v3; Llama produced 0/44, 0/46 and 0/63 misses on the same sets.
- Mechanism of evasion: the late attention evidence often remains on transformed inputs, but MLP computation before the Layer 13 boundary can reverse or suppress that evidence, causing a benign final label rather than deletion of the detector.
- Pre-deployment signals: a linear ridge probe from resid_pre13 predicts Layer-12 head contributions with r=0.80–0.87 within-model and transfers cross-model to a useful family-level signal; an indicator-token sign test detects families where ablation flips the role of canonical tokens (Invoke-WebRequest shows +1.13 mean logit-diff in Foundation-Sec versus −1.60 in Llama).
Limitations
The study uses a 293-pair matched cohort drawn from seven indicator families and filters to examples that both models initially classify correctly, so results are focused and not population-level prevalence estimates. The benchmark covers specific transformation families and accepted variants; other obfuscation techniques and more fine-tuned models remain to be tested. Per-script prioritisation is noisy; the methods are intended for family-level red-teaming rather than per-script prediction.
Implications
An attacker can cheaply evade a fine-tuned security classifier by applying behaviour-preserving indicator transformations that disrupt fine-tuned indicator semantics even though the underlying malicious evidence persists internally. Fine-tuning can therefore convert inherited detection structure into brittle, token-sensitive rules that pass canonical tests but expand the evasion surface. Pre-deployment teams should compare base and fine-tuned activations on canonical inputs, use the sign test to flag families that flipped token roles, and prioritise red-team generation of alias, reconstruction, string-construction, indirection and case-mutation variants for those families.