Researchers Expose How Embedded Prompts Manipulate Reviews
Attacks
A large language model is an AI that predicts and generates text by learning patterns from huge amounts of writing. Prompt injection is when hidden or embedded instructions in input trick that AI into following attacker-specified directions.
This study tests GPT-5-mini as a reviewer aid on 1,441 conference papers and finds concrete, exploitable problems. The model systematically inflates ratings for weaker papers, aligns better on top-tier work, and—critically—field-specific instructions hidden inside PDFs can push reviews toward perfect scores and erase reported weaknesses. Location in the document matters, with edges being most effective; repetition adds little. GPT-4o-mini shows similar or worse susceptibility in parallel tests.
Why readers should care: automated review tools are already entering workflows. If attackers or careless authors can nudge scores, the integrity of conferences and journals erodes. This is not theoretical hair-splitting; it is a practical attack surface that can distort careers and the scientific record.
Minimal viable controls (do these first):
- Sanitize extracted text: strip hidden layers and nonvisible content before any model sees a submission.
- Require human review for borderline or high-stakes decisions; never accept model-only ratings.
- Log inputs and model versions, and run simple anomaly detectors for rating spikes.
Good-Better-Best options:
- Good: calibration datasets and automated cross-checks with a second model.
- Better: multi-model consensus, provenance-tagged submissions, and periodic red-team prompt-injection tests.
- Best: signed, tamper-evident submission formats, mandated AI-assistance disclosure, and independent audits of model behavior.
Short version: treat these systems as assistants, not judges. Patch the input pipe, add human gates, and start adversarial testing today before someone corrupts peer review and calls it innovation.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review
🔍 ShortSpan Analysis of the Paper
Problem
Peer review is central to academic publishing but faces strain from rising submissions, reviewer overload and expertise mismatches. Large language models are increasingly used as reviewer aids, raising concerns about fairness, consistency and robustness to prompt injection attacks. This paper conducts a systematic evaluation of LLMs as academic reviewers, using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, and compares GPT-5-mini with human reviewers across ratings, strengths and weaknesses, and robustness to embedded prompts.
Approach
The study creates three review sets: official human reviews, structured LLM generated reviews, and structured LLM reviews with prompt injection. It employs structured prompting with reference paper calibration, topic modelling and similarity analysis, and embeds covert instructions into papers to test prompt injection. The dataset comprises 991 ICLR and 450 NeurIPS papers. Reviews follow a JSON schema with strengths, weaknesses and rating; reference papers are used as calibration anchors to guide the LLM towards unbiased, consistent evaluations. Strengths and weaknesses are tokenised into bullet points for alignment. Topic modelling uses BERTopic with all mpnet base vectors, UMAP for dimensionality reduction and KMeans clustering yielding eighty topics, which are manually mapped into eight high level themes. An injection framework inserts malicious prompts into PDFs at four locations with varying repetition and includes field specific prompts that target rating inflation and suppression of weaknesses. The model used is GPT-5-mini; comparisons include GPT-4o-mini in injection experiments.
Key Findings
- LLMs systematically inflate ratings for weaker papers while aligning more closely with human judgments on stronger ones; across the 1,441 papers the average human rating was 5.70 and the LLM average was 6.86, indicating inflation by the model.
- Alignment between LLM and human ratings is stronger for higher quality papers, with 59.6 per cent of notable top tier papers showing close agreement; track level analysis shows higher alignment in General Machine Learning and Reinforcement Learning and wider discrepancies in Social Aspects of Machine Learning, Theory and Optimisation.
- Humans emphasise novelty of study design and presentation clarity whereas LLMs prioritise empirical rigour and technical implementation; Jensen Shannon divergence between topic distributions is 0.031 for strengths and 0.043 for weaknesses, indicating moderated divergence in emphasis.
- Prompt injection can manipulate LLM reviews; overarching prompts cause modest rating inflation but field specific embedded instructions can substantially bias outputs, including pushing ratings to the maximum and reducing reported weaknesses. Location in the document matters, with injections at document boundaries being most effective; repetition has limited additional impact. GPT-5-mini shows susceptibility with about one third of cases yielding a perfect score under certain field specific prompts; GPT-4o-mini shows greater susceptibility in the same prompts.
Limitations
The study focuses on GPT-5-mini and a dataset restricted to ICLR 2023 and NeurIPS 2022 within computer science venues; findings may not generalise to other models or disciplines. Only two comparison models are examined for prompt injection and long term effects and deployment constraints require further study.
Why It Matters
The work demonstrates concrete vulnerabilities in AI agents used for critical tasks such as peer review, including susceptibility to hidden or field specific prompts embedded in documents and the potential for biased or manipulated outcomes. It highlights practical attack surfaces and the need for safeguards such as calibration datasets, input validation, red teams, layered defenses, provenance tracking and model agnostic checks. The findings support framing LLMs as calibrated assistants rather than judges, with policy and design implications for responsible use, transparency, accountability and auditability in AI assisted peer review, including disclosure of AI assistance and governance of prompt injection risks.