Classifiers Spot Prompt Injection in LLM Apps

Defenses

Published: Tue, Dec 16, 2025 • By Theo Solander

Classifiers Spot Prompt Injection in LLM Apps

Researchers curate and augment a prompt injection dataset and train multiple classifiers, including LSTM, feedforward, Random Forest and Naive Bayes, to detect malicious prompts before they reach a Large Language Model (LLM). Results show strong detection on a balanced corpus, and propose front-end filtering, logging and access limits to reduce risk.

Prompt injection is the newest form of an old problem: untrusted input changing program behaviour. This study takes a pragmatic approach. It builds an augmented dataset from HackAPrompt playground submissions, mixes in SQuAD2 material, and trains several classifiers to label prompts as malicious or benign before they touch a Large Language Model (LLM). The team uses TFâ€“IDF features, balances the data to roughly twenty thousand samples per class, and evaluates Long Short-Term Memory (LSTM), a three layer feedforward network, Random Forest and Naive Bayes models. On that dataset the detectors perform very well.

What the results mean

The headline is straightforward: conventional machine learning and neural models can separate many crafted instruction attacks from ordinary queries in a curated setting. The LSTM and Random Forest models show especially strong precision and recall on the balanced corpus. That does not mean the problem is solved. Past security cycles teach a cautionary lesson. Early web filters stopped simple SQL injection but failed once attackers adapted payloads or shifted context. Similarly, a high score on a test set that was filtered, deduplicated and augmented with keyword replacement will overstate real-world robustness.

The authors are transparent about limits. The scenario is single purpose rather than general conversational agents, and one model was trained on more data due to practical constraints. Models trained on balanced data face a deployment reality where malicious prompts are rare, adversaries can probe systems, and natural user language is highly variable. Overconfidence in a front-line detector can create a brittle defence that invites subtle bypasses.

What teams should do now

Use a detector as one layer, not a gatekeeper. Place a filter in front of the LLM to flag likely instruction attacks and log every prompt for later analysis. Apply concrete controls: run the LLM in read-only or sandbox modes for untrusted inputs, enforce rate limits, require authentication for elevated actions, and design escalation paths where a human reviews ambiguous cases. Where the paper suggests banning users whose prompts are predominantly instructions, treat bans as a high-friction option reserved for clear abuse; prefer progressive responses such as warnings, timeouts and temporary throttles.

Operationally, invest in adversarial testing and dataset diversity. Expand the training corpus beyond curated playground submissions, include real user traffic and red-team variants, and subject detectors to adaptive attack simulations. Consider hybrid architectures that combine cheap, fast classifiers for triage with slower, stronger analyses for ambiguous cases. Finally, monitor drift and retrain frequently; attackers will iterate and models must too.

The pattern is familiar: early gains from supervised detection, followed by attacker adaptation, then a healthier equilibrium of layered controls and ongoing measurement. Treat classifier detectors as a useful and pragmatic risk reduction, not a final cure. That honest posture will keep systems safer and engineering budgets under control.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Detecting Prompt Injection Attacks Against Application Using Classifiers

Authors: Safwan Shaheer, G. M. Refatul Islam, Mohammad Rafid Hamid, Md. Abrar Faiaz Khan, Md. Omar Faruk, and Yaseen Nur

Prompt injection attacks can compromise the security and stability of critical systems, from infrastructure to large web applications. This work curates and augments a prompt injection dataset based on the HackAPrompt Playground Submissions corpus and trains several classifiers, including LSTM, feed forward neural networks, Random Forest, and Naive Bayes, to detect malicious prompts in LLM integrated web applications. The proposed approach improves prompt injection detection and mitigation, helping protect targeted applications and systems.

🔍 ShortSpan Analysis of the Paper

Problem

Prompt injection attacks threaten the security and stability of LLM enabled applications. The study addresses detection and mitigation by building datasets and training classifiers to recognise malicious prompts before they influence system behaviour.

Approach

The work compiles and augments a prompt injection dataset based on the HackAPrompt Playground Submissions corpus, replacing keywords with random strings to reduce bias. It merges with a version of the SQuAD2 dataset and labels prompts as malicious or benign. Text is transformed with TF IDF features (max features 1000) and fed to several classifiers: LSTM, a feedforward neural network, Random Forest, and Naive Bayes. An 80/20 train test split is used after data preparation; the data is balanced to 50/50. The neural networks use random initial embeddings and are trained with Adam with a learning rate of 0.001 in batches of 96 for 25 epochs. The architectures include a three layer dense feedforward network with an embedding input layer and a sigmoid output, and an LSTM followed by a dense layer and a sigmoid output. Evaluation covers precision, recall, F1 and confusion matrices for both classes. A mitigation strategy places the detector in front of the LLM, logs the user prompts, and bans a user if the majority of prompts are instructions, while warning or timing out others. Additional safeguards include read only and sandbox access, rate limiting, and universal authentication to reduce attack surface.

Key Findings

The LSTM classifier achieved near perfect performance with high precision, recall and F1 scores for benign and malicious classes on a balanced data set of around twenty thousand samples per class, producing nineteen thousand nine hundred forty one true positives and twenty thousand fifteen true negatives, with a small number of false negatives and false positives.
The Random Forest classifier performed similarly well with high precision and recall and very low false negatives and false positives, though it was trained on more data than the other models in this study due to hardware and time considerations.
The Naive Bayes model also showed high accuracy with a balanced distribution of errors, while the feedforward neural network offered strong precision and recall but a lower overall accuracy compared with the other models, indicating a different error profile.
Across models the results support the feasibility of a front end detector to mitigate prompt injection, with a mitigation strategy that logs prompts, considers author bans for repeated instructions, and uses warnings or timeouts for others, alongside input controls such as read only access and rate limiting to reduce risk.
Quality and augmentation steps include balancing the dataset, filtering out overly long or short prompts, deduplication and manual quality checks, with future work focused on expanding datasets, exploring hybrid model architectures and adversarial evaluation to improve robustness.

Limitations

Limitations include the specificity of the prompt injection scenario to a single purpose application rather than a conversational chatbot, and the observation that Random Forest used more training data than some other models due to hardware and time constraints. Generalisability to broader LLM applications and real time, multi domain prompts remains to be validated.

Why It Matters

The study demonstrates that both conventional machine learning and neural network detectors can effectively identify prompt injection attempts, enabling practical mitigations such as pre filtering of inputs, logging and user level responses, and restricted access to limit harm. The security implications include reducing output manipulation and data leakage in AI powered services, supporting safer deployment of LLM enabled applications. The suggested mitigations emphasise read only access, rate limiting, universal authentication, and real time detection to guard against prompt based attacks in critical systems.

Attribution Original paper on arXiv