Simple Prompt Injections Hijack LLM Scientific Reviews

Attacks

Published: Mon, Sep 15, 2025 • By Lydia Stratus

Simple Prompt Injections Hijack LLM Scientific Reviews

New research shows trivial prompt injections can steer LLM-generated peer reviews toward acceptance, sometimes reaching 100% acceptance rates. The study finds many models are biased toward saying accept even without manipulation, and simple hidden prompts reliably change scores. This exposes a real threat to automated review workflows and decision integrity.

A new paper demonstrates something ops teams have to treat like a live wire: tiny, hidden prompt injections reliably push LLM-generated scientific reviews toward acceptance. In plain terms, a careless paste or embedded instruction can flip a review from reject to accept, and many models already skew toward acceptance even with neutral input.

Diagram-in-words: [Reviewer laptop] -> [Paste paper into LLM or endpoint] -> [LLM returns structured review] -> [Copy to review form]. The weak links are the LLM endpoint, the model behavior on GPUs, vector stores and preprocessing, and where secrets and audit logs live.

Top infra risks

Endpoint compliance - models that follow hidden prompts instead of sandbox rules.
Vector poisoning - injected context in embeddings that persist across queries.
Secrets and creds - API keys and service tokens used by careless review scripts.
Data paths - unvalidated pipelines that strip provenance when converting PDFs to text.
GPU tenancy - shared or poorly isolated inference hosts that mix workloads and logs.

Quick checklist for first responders

Quarantine the endpoint serving review automation.
Snapshot current logs and model inputs for audit.
Rotate any exposed API keys used by the workflow.
Switch to read-only copies of vector DBs while investigating.

Stepwise mitigations - run book

Implement input containment: strip or neutralize hidden prompt tokens before forwarding text to the model.
Validate outputs against a strict schema and reject out-of-range ratings automatically.
Run adversarial tests: replay known injection patterns and log model responses.
Enforce multi-model or human-in-loop approval for acceptance decisions and require provenance tags on every review artifact.
Harden infra: isolate GPU hosts, lock down vector DB write paths, audit secret access, and monitor for sudden acceptance-rate shifts.

Why this matters: peer review decisions shape careers and funding. If your automation can be nudged by a stray sentence, attackers or sloppy integrations can skew outcomes at scale. Treat LLM review pipelines like any other critical service: add containment, validation, and provenance, and assume someone will try to game it.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Authors: Janis Keuper

The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such "attacks" - although seen by some commentators as "self-defense" - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies prompt injection attacks on large language model generated reviews of scientific publications to assess their practicality and impact on the peer review process. It uses a systematic evaluation based on 1000 reviews of 2024 ICLR papers produced by a wide range of LLMs to determine whether simple hidden prompts can steer review scores toward acceptance and to characterise any inherent bias in LLM based reviews.

Approach

The study mirrors a likely real world scenario where a careless reviewer pastes a paper into an LLM and copies a structured output into the review form. Papers from ICLR 2024 were converted from PDFs to Markdown via Mistral OCR and fed to multiple LLMs through structured prediction prompts to extract JSON styled reviews. A fixed system prompt and a paper specific prompt containing a hidden injection were used, with three injection strings: neutral, positively biased, and negatively biased. A diverse set of models were evaluated, including commercial options from OpenAI and Google and open or locally hosted models, all via their APIs or local deployments. Outputs were constrained to a predefined review schema and a rating scale consisting of the values 1, 3, 5, 6, 8 and 10; the researchers assessed adherence to the schema and to the rating range and tracked the occurrence of invalid scores such as 4.

Key Findings

Very simple prompt injections are indeed highly effective, with positive injections driving acceptance scores up to 100 percent for the evaluated papers.
Across many models LLM based reviews display a strong bias toward acceptance, with acceptance dominated across neutral prompts as well, exceeding 95 percent in many cases.
Robustness to manipulation varies by model: some systems show little to no response to prompt injections, while others follow the instructions closely; the likelihood of manipulation correlates with the model's ability to comply with the required structured output and to predict valid rating values, whereas some models also produce invalid scores such as four.

Limitations

The authors note that the study focuses on a scenario of a careless reviewer using an LLM and may not generalise to carefully designed or fine tuned review models. Potential training data leakage could influence results, as many models may have seen ICLR papers or reviews during training, though the observed shifts between human and LLM reviews suggest the effect is substantial. The pre processing pipeline and reliance on a fixed output schema may also affect findings, and results are drawn from ICLR 2024 submissions.

Why It Matters

The work highlights critical vulnerabilities in AI assisted peer review workflows and emphasises the need for defences including prompt defences defences prompt containment input and output auditing adversarial testing multi model or human in the loop verification and robust monitoring to prevent biased or manipulated outcomes. It also raises societal concerns about manipulation of peer review potentially skewing scientific progress funding decisions and policy directions and eroding research integrity and public trust.

Attribution Original paper on arXiv