Hidden Prompt Injections Hijack LLM Peer Review
Attacks
New work demonstrates a practical attack: authors hide short, adversarial prompts inside PDF manuscripts that remain invisible to human readers but steer large language model (LLM) generated reviews. Prompt injection means adding instructions into input material so the model follows them; LLMs are the automated systems that write summaries or assessments. The most concerning finding is that these injections reliably bias reviews across multiple commercial systems and can fool honest-but-lazy reviewers who lean on automated outputs.
This matters because journals and conferences are experimenting with LLM-assisted review to speed triage. When the model effectively acts as a gatekeeper, subtle manipulations can distort judgments about novelty, methodology, or safety. The paper also shows an arms race: simple keyword detectors and blocklists fail against obfuscation, and attacks often transfer across models, so piecemeal vendor fixes are not enough.
Policy and governance intersect with controls in direct, practical ways. Requiring provenance (declare when AI helped produce a review), standard robustness tests, and audited toolchains raises the bar for attackers. But every control carries trade-offs: stricter checks add time and cost, provenance rules can be gamed, and aggressive filtering risks false positives that penalize legitimate submissions. Compliance without substance is performative; real safety pairs governance with technical hygiene like prompt sanitation, input normalization, and adversarial testing.
Concrete steps to take this quarter: run threat-model tabletop exercises, enable basic prompt-scrubbing on uploaded PDFs, and flag reviews that closely mirror submitted text. Later actions: require provenance reporting, commission independent robustness audits of vendor LLMs, and develop standardized prompt-injection resistance tests across platforms.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review
🔍 ShortSpan Analysis of the Paper
Problem
The study investigates the vulnerability of large language model assisted peer review to hidden prompt injection attacks, where authors embed adversarial text within a paper s PDF to influence the LLM generated review, raising concerns about reliability, integrity and governance in AI assisted scholarly review.
Approach
The authors formalise three threat models representing different attacker motives Ignore Detect and Exploit. They design adversarial prompts that remain invisible to human readers and can steer an LLM s output. A user study with domain scholars yields four representative reviewing prompts used to elicit peer reviews from LLMs. They evaluate robustness across different reviewing prompts, commercial LLM based systems and a range of peer reviewed papers. The experimental setup uses papers rejected from ICLR, a toolkit to inject prompts into PDFs, and services that interact with PDFs such as GPT based tools, Gemini and Claude. They also test existing prompts found in the wild and assess countermeasures including obfuscation techniques and potential system level changes. Resources including a dataset of about nine thousand LLM generated reviews are released in a repository.
Key Findings
- Adversarial prompts can reliably mislead LLMs in the peer review setting, sometimes affecting an honest but lazy reviewer.
- Ignore adversarial prompts are always successful across most tested LLMs, with some variation in performance for Gemini and a strong effect for Claude Sonnet 4.
- Exploit prompts consistently raise review scores above baseline, with Exploit 1 and Exploit 2 producing very high ratings in many cases.
- Detect prompts show mixed effectiveness; Detect 1 attains higher success rates than Detect 2, and success is influenced by the paper content and domain alignment.
- Transfer of the attacks across GPT based models is common; GPT o3 generally yields similar results with different baselines, while GPT 4o shows substantial but smaller effects.
- Attacks against Gemini 2.5 flash and Claude Sonnet 4 vary by system and prompt design; Exploit prompts remain effective, whereas Ignore is less effective on Gemini but highly effective on Claude.
- Injecting multiple adversarial prompts or using chat markup tags strengthens effectiveness; removing the tags reduces detectability and combining prompts boosts success rates further.
- Tests against in the wild prompts show limited impact for short prompts, whereas a very long prompt can have a measurable but weaker effect than targeted Exploit prompts.
- Defences such as automated checks, robustness testing and prompt hygiene are explored; obfuscation techniques can defeat basic keyword based detection, indicating a need for governance and systemic safeguards.
Limitations
The study relies on closed source LLMs in a black box setting and results may not generalise to all models. It uses a limited sample of papers and is constrained by practical resources, which may affect scalability and generalisability. The experiments occurred between March and August 2025 and may not reflect future model updates. Appendix includes negative results that inform boundary conditions.
Why It Matters
The work highlights a realistic risk to the integrity of AI assisted peer review where hidden prompts embedded in manuscripts can steer reviews without human readers noticing. The findings have implications for policy, governance and detection strategies in scholarly publishing and other AI assisted workflows. The authors discuss practical countermeasures such as automated checks, robustness testing and prompt hygiene to harden review processes and reduce prompt leakage. A dataset of approximately nine thousand LLM generated reviews is released to support defence research and detector evaluation.