Researchers Expose How Embedded Prompts Manipulate Reviews

Attacks

Published: Mon, Sep 15, 2025 • By Rowan Vale

Researchers Expose How Embedded Prompts Manipulate Reviews

New research shows language models used to help peer review can be steered by hidden instructions embedded inside submissions. Models inflate scores for weaker work and can be forced to suppress weaknesses. The study exposes a practical attack surface and urges urgent safeguards to stop manipulated, unreliable automated reviews.

A large language model is an AI that predicts and generates text by learning patterns from huge amounts of writing. Prompt injection is when hidden or embedded instructions in input trick that AI into following attacker-specified directions.

This study tests GPT-5-mini as a reviewer aid on 1,441 conference papers and finds concrete, exploitable problems. The model systematically inflates ratings for weaker papers, aligns better on top-tier work, and—critically—field-specific instructions hidden inside PDFs can push reviews toward perfect scores and erase reported weaknesses. Location in the document matters, with edges being most effective; repetition adds little. GPT-4o-mini shows similar or worse susceptibility in parallel tests.

Why readers should care: automated review tools are already entering workflows. If attackers or careless authors can nudge scores, the integrity of conferences and journals erodes. This is not theoretical hair-splitting; it is a practical attack surface that can distort careers and the scientific record.

Minimal viable controls (do these first):

Sanitize extracted text: strip hidden layers and nonvisible content before any model sees a submission.
Require human review for borderline or high-stakes decisions; never accept model-only ratings.
Log inputs and model versions, and run simple anomaly detectors for rating spikes.

Good-Better-Best options:

Good: calibration datasets and automated cross-checks with a second model.
Better: multi-model consensus, provenance-tagged submissions, and periodic red-team prompt-injection tests.
Best: signed, tamper-evident submission formats, mandated AI-assistance disclosure, and independent audits of model behavior.

Short version: treat these systems as assistants, not judges. Patch the input pipe, add human gates, and start adversarial testing today before someone corrupts peer review and calls it innovation.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

Authors: Changjia Zhu, Junjie Xiong, Renkai Ma, Zhicong Lu, Yao Liu, and Lingyao Li

Peer review is the cornerstone of academic publishing, yet the process is increasingly strained by rising submission volumes, reviewer overload, and expertise mismatches. Large language models (LLMs) are now being used as "reviewer aids," raising concerns about their fairness, consistency, and robustness against indirect prompt injection attacks. This paper presents a systematic evaluation of LLMs as academic reviewers. Using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, we evaluate GPT-5-mini against human reviewers across ratings, strengths, and weaknesses. The evaluation employs structured prompting with reference paper calibration, topic modeling, and similarity analysis to compare review content. We further embed covert instructions into PDF submissions to assess LLMs' susceptibility to prompt injection. Our findings show that LLMs consistently inflate ratings for weaker papers while aligning more closely with human judgments on stronger contributions. Moreover, while overarching malicious prompts induce only minor shifts in topical focus, explicitly field-specific instructions successfully manipulate specific aspects of LLM-generated reviews. This study underscores both the promises and perils of integrating LLMs into peer review and points to the importance of designing safeguards that ensure integrity and trust in future review processes.

🔍 ShortSpan Analysis of the Paper

Problem

Peer review is central to academic publishing but faces strain from rising submissions, reviewer overload and expertise mismatches. Large language models are increasingly used as reviewer aids, raising concerns about fairness, consistency and robustness to prompt injection attacks. This paper conducts a systematic evaluation of LLMs as academic reviewers, using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, and compares GPT-5-mini with human reviewers across ratings, strengths and weaknesses, and robustness to embedded prompts.

Approach

The study creates three review sets: official human reviews, structured LLM generated reviews, and structured LLM reviews with prompt injection. It employs structured prompting with reference paper calibration, topic modelling and similarity analysis, and embeds covert instructions into papers to test prompt injection. The dataset comprises 991 ICLR and 450 NeurIPS papers. Reviews follow a JSON schema with strengths, weaknesses and rating; reference papers are used as calibration anchors to guide the LLM towards unbiased, consistent evaluations. Strengths and weaknesses are tokenised into bullet points for alignment. Topic modelling uses BERTopic with all mpnet base vectors, UMAP for dimensionality reduction and KMeans clustering yielding eighty topics, which are manually mapped into eight high level themes. An injection framework inserts malicious prompts into PDFs at four locations with varying repetition and includes field specific prompts that target rating inflation and suppression of weaknesses. The model used is GPT-5-mini; comparisons include GPT-4o-mini in injection experiments.

Key Findings

LLMs systematically inflate ratings for weaker papers while aligning more closely with human judgments on stronger ones; across the 1,441 papers the average human rating was 5.70 and the LLM average was 6.86, indicating inflation by the model.
Alignment between LLM and human ratings is stronger for higher quality papers, with 59.6 per cent of notable top tier papers showing close agreement; track level analysis shows higher alignment in General Machine Learning and Reinforcement Learning and wider discrepancies in Social Aspects of Machine Learning, Theory and Optimisation.
Humans emphasise novelty of study design and presentation clarity whereas LLMs prioritise empirical rigour and technical implementation; Jensen Shannon divergence between topic distributions is 0.031 for strengths and 0.043 for weaknesses, indicating moderated divergence in emphasis.
Prompt injection can manipulate LLM reviews; overarching prompts cause modest rating inflation but field specific embedded instructions can substantially bias outputs, including pushing ratings to the maximum and reducing reported weaknesses. Location in the document matters, with injections at document boundaries being most effective; repetition has limited additional impact. GPT-5-mini shows susceptibility with about one third of cases yielding a perfect score under certain field specific prompts; GPT-4o-mini shows greater susceptibility in the same prompts.

Limitations

The study focuses on GPT-5-mini and a dataset restricted to ICLR 2023 and NeurIPS 2022 within computer science venues; findings may not generalise to other models or disciplines. Only two comparison models are examined for prompt injection and long term effects and deployment constraints require further study.

Why It Matters

The work demonstrates concrete vulnerabilities in AI agents used for critical tasks such as peer review, including susceptibility to hidden or field specific prompts embedded in documents and the potential for biased or manipulated outcomes. It highlights practical attack surfaces and the need for safeguards such as calibration datasets, input validation, red teams, layered defenses, provenance tracking and model agnostic checks. The findings support framing LLMs as calibrated assistants rather than judges, with policy and design implications for responsible use, transparency, accountability and auditability in AI assisted peer review, including disclosure of AI assistance and governance of prompt injection risks.

Attribution Original paper on arXiv