Prompt Injections Hijack AI Paper Reviews

Attacks

Published: Tue, Nov 04, 2025 • By Dr. Marcus Halden

Prompt Injections Hijack AI Paper Reviews

New research shows hidden prompts embedded in PDF submissions can push AI-assisted reviewers to give overly positive evaluations. Two attack types—static and iterative—raise scores on frontier reviewer models, especially Gemini and DeepSeek. A simple detection step cuts success but adaptive attackers can still bypass it, so layered safeguards are needed.

Research teams examined a practical and worrying threat for AI-assisted peer review: concealed prompts placed inside submitted papers that steer automated reviewers toward favourable scores. The study focuses on modern large language models (LLM) used as reviewer assistants and treats the problem in straightforward security terms: an attacker manipulates inputs to change an automated decision.

What the researchers did

The authors test two attack patterns. The static attack embeds a fixed injection prompt inside the paper PDF. The iterative attack refines an injection prompt against a simulated reviewer over several rounds to maximise its effect. They run experiments on 100 ICLR 2025 submissions using a controlled PDF text extraction workflow. Three frontier reviewer models are evaluated: GPT-5, DeepSeek-Chat, and Gemini-2.5-Pro. Reviews use a uniform ICLR-style scoring rubric on a 1 to 10 scale, with three dimensions scored and each paper reviewed three times to compute mean and standard deviation.

Attacks are measured by how much they raise those average scores and by whether reviews reach unusually high values. The researchers also probe robustness: changing where injections are inserted, varying paper length and human ratings, and attempting to transfer crafted prompts across models. Finally, they test a simple detection-based defence that asks the reviewer to flag suspicious prompt material, and they examine adaptive attackers that try to evade that defence.

Key findings and limits

The results are clear but not identical across models. Static injections raise average scores by 1.91 points for Gemini, 2.80 for DeepSeek, and 1.24 for GPT-5, showing model-dependent vulnerability. Iterative attacks are more powerful: within up to three optimisation rounds they push Gemini and DeepSeek reviews close to the maximum score, while GPT-5 shows smaller, more restrained increases. The crafted prompts transfer across models in some cases; prompts tuned on GPT-5 transfer particularly well to the other two models. The simple detection defence reduces attack success in many cases, but adaptive attacks can partially circumvent it and still produce substantial score inflation.

There are important caveats. The study uses 100 papers and three reviewer models because of cost constraints, so we should be cautious about generalising to all review pipelines or future models. Still, the direction is clear: input-level prompt injections can shift automated assessments, and defences that rely solely on simple detection are likely to be brittle in the face of adaptive adversaries.

Why this matters is also practical. Peer review scores influence acceptance, funding and reputational decisions. If bad actors can reliably bias AI-assisted reviews, the integrity of those decisions erodes. The problem is not a single software bug but a class of input manipulation attacks that require system-level thinking to mitigate.

Operational takeaways

Assume inputs can be hostile: apply sanitisation and strict extraction controls before feeding text to an LLM reviewer.
Don’t rely on one model or a single detection check: use ensembles, audited human-in-the-loop decisions and monitoring for anomalous scoring patterns.
Red team review pipelines with iterative prompt-injection tests and track cross-model transfer to find weak spots early.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Authors: Qin Zhou, Zhexin Zhang, Zhi Li, and Limin Sun

With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

🔍 ShortSpan Analysis of the Paper

Problem

AI models are increasingly used to assist in scientific peer review, but hidden prompt injections embedded in submissions can manipulate AI reviewers to give overly positive evaluations. The paper investigates In-Paper Prompt Injection IPI attacks and defenses, identifying two attack classes static and iterative, showing they can raise scores for frontier AI reviewers and are robust across settings. A simple detection based defence reduces success but adaptive attackers can partially bypass, highlighting an arms race and need for safeguards including input controls, monitoring, alignment, and human oversight.

Approach

The authors define two attack paradigms: a static attack with fixed injected prompts embedded invisibly in PDFs, and an iterative attack that optimises prompts through several feedback rounds against a simulated reviewer model. They evaluate three frontier reviewer models GPT-5, DeepSeek-Chat, and Gemini-2.5-Pro on 100 ICLR 2025 submissions, using a controlled PDF text extraction workflow. A uniform ICLR scoring rubric on a 1 to 10 scale is used, with three dimensions assessed per paper, and each paper reviewed three times by the same reviewer model to obtain mean and standard deviation. A detection based defence is proposed that prompts the reviewer to identify potential injections. An adaptive adversarial attack is also explored to evade the defence.

Key Findings

Static attacks increase average scores by 1.91 for Gemini, 2.80 for DeepSeek and 1.24 for GPT-5, indicating model dependent vulnerability.
Iterative attacks push scores close to the maximum for Gemini and DeepSeek within up to three rounds, while GPT-5 shows more restrained increases.
Attack effectiveness is robust across insertion positions, varying human ratings, and paper lengths, and cross model transfer is possible, with prompts from GPT-5 showing strongest cross model transfer to Gemini and DeepSeek.
GPT-5 is comparatively more robust to external attack prompts, whereas Gemini and DeepSeek are more vulnerable to cross model prompts; cross model transfer aligns with the relative robustness of the target model.
Detection based defence can identify attack prompts in most cases and reduce scores toward non attacked baselines, but an adaptive attacker can evade detection, yielding substantial score inflation in some cases.
Adaptive attacks against defence increase average scores and many attacks remain undetected, revealing limitations of purely detection based strategies and the need for defence in depth.

Limitations

The study is limited to 100 papers and three AI reviewers due to the substantial cost of generating AI assisted reviews. Further work is needed to explore larger scales and additional reviewer models and real world deployment considerations.

Why It Matters

The work exposes vulnerabilities in AI assisted review pipelines that can bias evaluations, with potential consequences for funding, policy and trust in published work. It provides practical takeaways for prompt engineering and red teaming to test and strengthen AI aided evaluation systems, and emphasises the need for robust safeguards, auditing, input controls, prompt monitoring, model alignment and human in the loop checks to mitigate abuse.

Attribution Original paper on arXiv