Study Probes JATMO Defences Against Prompt Injection

Defenses

Published: Wed, Dec 03, 2025 • By Natalie Kestrel

Study Probes JATMO Defences Against Prompt Injection

Researchers test JATMO fine tuning against HOUYI prompt-injection attacks and find reduced but persistent vulnerability. JATMO lowers attack success about four to ten times versus an instruction-tuned GPT-3.5-Turbo baseline, but multilingual and code-like prompts still bypass defences. The paper argues for layered, adversarially informed mitigations in production.

Large Language Model (LLM) deployments keep running into the same awkward problem: models follow instructions too well, and attackers weaponise that behaviour with prompt injection. A new paper tests a narrow fine tuning defence called JATMO against an evolving attack framework, HOUYI, and finds partial protection but no silver bullet.

What the researchers did

The team fine tuned non-instruction-tuned base models using JATMO to perform a single summarisation task of Amazon All Beauty reviews. They used LoRA to adapt LLaMA 2-7B, Qwen1.5-4B and Qwen1.5-0.5B, producing models trained on 1500 input-output pairs created with a GPT-3.5 teacher. As a comparison point they used an instruction-tuned GPT-3.5-Turbo baseline.

To probe robustness they adapted HOUYI, a genetic attack framework that mutates prompts, adding custom fitness scoring, new mutation logic and a local harness to run attacks directly against the fine tuned models. They evaluated two attack goals, content manipulation and information gathering, using 72 unique evolved injections per attack type and measured attack success and ROUGE-L to track task fidelity.

What they found and why it matters

Results are frank. The instruction-tuned GPT-3.5-Turbo is fully compromised by both attack types. JATMO-tuned models perform better, lowering attack success roughly four to tenfold against the baseline. That is meaningful improvement, but not immunity. Mid-size models, notably the Qwen 4B, remain susceptible to cleverly phrased prompts, multilingual cues and code-like triggers that derail the summarisation objective.

There is a clear, uncomfortable trade off: models that score higher on ROUGE-L, meaning they generate better summaries, also tend to be easier to hijack. In plain terms, striving for higher task quality can reintroduce instruction-following tendencies that attackers exploit. The study also shows residual instruction-following in non-instruction-tuned models; imperative-style prompts sometimes still hijack outputs despite fine tuning.

The work has sensible limitations. It tests a single task and a small set of base models, uses LoRA rather than full-weight updates, and runs attacks in a local harness rather than in diverse production settings. Those caveats matter when you try to generalise the results to other tasks or heavier models.

For security teams the main takeaway is practical: fine tuning narrows the attack surface but does not remove it. Defence needs to be layered and adversarially informed. Upstream filtering, semantic validators, and runtime output constraints matter. Treat JATMO-style tuning as one tool, not a last line of defence.

Checks your team can run right now:

Run an automated prompt-injection sweep using mutated multilingual and code-like prompts against your fine tuned models.
Measure a task-fidelity metric such as ROUGE-L alongside attack success so you can spot the quality-versus-robustness trade off.
Deploy simple runtime guards: input sanitisation, prompt firewalls and output schema enforcement to limit the impact of successful injections.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Securing Large Language Models (LLMs) from Prompt Injection Attacks

Authors: Omar Farooq Khan Suri and John McCrae

Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model's instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.

🔍 ShortSpan Analysis of the Paper

Problem

Large Language Models are increasingly deployed but their instruction following makes them vulnerable to prompt injection attacks. JATMO trains non instruction tuned base models to perform a single narrow task in order to reduce susceptibility to adversarial prompts. This study tests JATMO against HOUYI, a genetic attack framework that mutates prompts to bypass guardrails. We adapt HOUYI with custom fitness scoring, revised mutation logic and a local testing harness. We fine tuned LLaMA 2 7B, Qwen 1.5 4B and Qwen 1.5 0.5B with JATMO, and compare them to a GPT-3.5 Turbo baseline. The task is summarisation of Amazon All Beauty reviews, using 1500 input output pairs generated with a GPT-3.5 teacher. Training uses LoRA. Attacks target content manipulation and information gathering.

Approach

The study combines JATMO style fine tuning of non instruction tuned models with a modified HOUYI prompt injection framework to assess robustness under adversarial pressure. A summarisation dataset is created from the Amazon All Beauty reviews by grouping reviews into triples and generating concise summaries with a GPT-3.5 teacher, yielding 1500 input output pairs formatted for fine tuning. LoRA is used to train the models efficiently, leaving the base weights frozen. The evaluation contrasts three models versus an instruction tuned GPT-3.5 Turbo baseline. HOUYI is adapted with task specific fitness scoring for content manipulation and information gathering, reworked disruptors, and a local harness that runs on the fine tuned models rather than through an external API. Each model is evaluated with 72 unique injection prompts per attack type, drawn from the evolved prompt population. Two metrics are used: attack success rate and ROUGE-L to estimate task fidelity under attack.

Key Findings

GPT-3.5 Turbo is fully compromised by both attack types, achieving 100% success.
JATMO tuned models substantially reduce vulnerability, lowering average attack success rates by roughly four to ten folds compared with the instruction tuned baseline.
Nevertheless JATMO does not fully prevent injections; mid sized models such as Qwen 4B remain susceptible to adversarial phrasing, multilingual mutations and code related triggers.
A clear trend shows higher ROUGE L scores correlate with higher injection success, indicating a trade off between generation quality and robustness under adversarial pressure.
Non instruction tuned models retain residual instruction following tendencies, with prompts that resemble imperative cues sometimes hijacking the output despite fine tuning.
Multilingual and mixed language prompts can bypass learned patterns, and prompts containing coding cues can trigger code generation biases that derail the summarisation task.

Limitations

Limitations include evaluation on only three base models with LoRA fine tuning for a single summarisation task and dataset, which may limit generalisability to other tasks or architectures. The attack configuration, although substantial, remains finite (72 prompts per type) and results depend on the chosen teacher model and data distribution. The study relies on a modified local harness for evaluation, which may differ from real world deployment environments. These factors suggest caution when extrapolating to broader settings.

Why It Matters

The findings show that fine tuning can meaningfully reduce prompt injection risk relative to instruction tuned baselines, but it does not provide a complete shield. The observed vulnerability associated with higher task quality, multilingual hints, and code related prompts highlights the need for layered, adversarially informed mitigations. Practical implications include implementing upstream input filtering and prompt firewalls, semantic validators to detect override cues, and formatting based safeguards alongside output constrained decoding and schema enforcement. The authors advocate for large scale adversarial training across multilingual and multi intent prompts, along with risk weighted evaluation that accounts for the severity of attacks. Together these strategies point to a holistic defence framework combining fine tuning, adversarial training and runtime safeguards to reduce both the likelihood and potential impact of prompt injection attacks.

Attribution Original paper on arXiv