Hidden Prompt Injections Hijack LLM Peer Review

Attacks

Published: Fri, Aug 29, 2025 • By James Armitage

Hidden Prompt Injections Hijack LLM Peer Review

New research shows hidden prompt injections embedded inside paper PDFs can steer large language model (LLM) reviews without human notice. Authors demonstrate attacks that reliably bias automated reviews across commercial systems, expose detection gaps, and test defenses. The work highlights risks to scholarly integrity and urges governance that pairs policy with practical controls.

New work demonstrates a practical attack: authors hide short, adversarial prompts inside PDF manuscripts that remain invisible to human readers but steer large language model (LLM) generated reviews. Prompt injection means adding instructions into input material so the model follows them; LLMs are the automated systems that write summaries or assessments. The most concerning finding is that these injections reliably bias reviews across multiple commercial systems and can fool honest-but-lazy reviewers who lean on automated outputs.

This matters because journals and conferences are experimenting with LLM-assisted review to speed triage. When the model effectively acts as a gatekeeper, subtle manipulations can distort judgments about novelty, methodology, or safety. The paper also shows an arms race: simple keyword detectors and blocklists fail against obfuscation, and attacks often transfer across models, so piecemeal vendor fixes are not enough.

Policy and governance intersect with controls in direct, practical ways. Requiring provenance (declare when AI helped produce a review), standard robustness tests, and audited toolchains raises the bar for attackers. But every control carries trade-offs: stricter checks add time and cost, provenance rules can be gamed, and aggressive filtering risks false positives that penalize legitimate submissions. Compliance without substance is performative; real safety pairs governance with technical hygiene like prompt sanitation, input normalization, and adversarial testing.

Concrete steps to take this quarter: run threat-model tabletop exercises, enable basic prompt-scrubbing on uploaded PDFs, and flag reviews that closely mirror submitted text. Later actions: require provenance reporting, commission independent robustness audits of vendor LLMs, and develop standardized prompt-injection resistance tests across platforms.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

Authors: Matteo Gioele Collu, Umberto Salviati, Roberto Confalonieri, Mauro Conti, and Giovanni Apruzzese

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

🔍 ShortSpan Analysis of the Paper

Authors

Matteo Gioele Collu, University of Padua, Italy; Umberto Salviati, University of Padua, Italy; Roberto Confalonieri, University of Padua, Italy; Mauro Conti, University of Padua, Italy and Orebro University, Sweden; Giovanni Apruzzese, University of Liechtenstein, Liechtenstein.

Problem

The study investigates the vulnerability of large language model assisted peer review to hidden prompt injection attacks, where authors embed adversarial text within a paper s PDF to influence the LLM generated review, raising concerns about reliability, integrity and governance in AI assisted scholarly review.

Approach

The authors formalise three threat models representing different attacker motives Ignore Detect and Exploit. They design adversarial prompts that remain invisible to human readers and can steer an LLM s output. A user study with domain scholars yields four representative reviewing prompts used to elicit peer reviews from LLMs. They evaluate robustness across different reviewing prompts, commercial LLM based systems and a range of peer reviewed papers. The experimental setup uses papers rejected from ICLR, a toolkit to inject prompts into PDFs, and services that interact with PDFs such as GPT based tools, Gemini and Claude. They also test existing prompts found in the wild and assess countermeasures including obfuscation techniques and potential system level changes. Resources including a dataset of about nine thousand LLM generated reviews are released in a repository.

Key Findings

Adversarial prompts can reliably mislead LLMs in the peer review setting, sometimes affecting an honest but lazy reviewer.
Ignore adversarial prompts are always successful across most tested LLMs, with some variation in performance for Gemini and a strong effect for Claude Sonnet 4.
Exploit prompts consistently raise review scores above baseline, with Exploit 1 and Exploit 2 producing very high ratings in many cases.
Detect prompts show mixed effectiveness; Detect 1 attains higher success rates than Detect 2, and success is influenced by the paper content and domain alignment.
Transfer of the attacks across GPT based models is common; GPT o3 generally yields similar results with different baselines, while GPT 4o shows substantial but smaller effects.
Attacks against Gemini 2.5 flash and Claude Sonnet 4 vary by system and prompt design; Exploit prompts remain effective, whereas Ignore is less effective on Gemini but highly effective on Claude.
Injecting multiple adversarial prompts or using chat markup tags strengthens effectiveness; removing the tags reduces detectability and combining prompts boosts success rates further.
Tests against in the wild prompts show limited impact for short prompts, whereas a very long prompt can have a measurable but weaker effect than targeted Exploit prompts.
Defences such as automated checks, robustness testing and prompt hygiene are explored; obfuscation techniques can defeat basic keyword based detection, indicating a need for governance and systemic safeguards.

Limitations

The study relies on closed source LLMs in a black box setting and results may not generalise to all models. It uses a limited sample of papers and is constrained by practical resources, which may affect scalability and generalisability. The experiments occurred between March and August 2025 and may not reflect future model updates. Appendix includes negative results that inform boundary conditions.

Why It Matters

The work highlights a realistic risk to the integrity of AI assisted peer review where hidden prompts embedded in manuscripts can steer reviews without human readers noticing. The findings have implications for policy, governance and detection strategies in scholarly publishing and other AI assisted workflows. The authors discuss practical countermeasures such as automated checks, robustness testing and prompt hygiene to harden review processes and reduce prompt leakage. A dataset of approximately nine thousand LLM generated reviews is released to support defence research and detector evaluation.

Attribution Original paper on arXiv