Confirmation Bias Lets Malicious PRs Evade LLM Review
Pentesting
AI-assisted code review is moving from novelty to default in many teams. That raises a hard question: what happens when the model believes the story your pull request tells it? This paper examines confirmation bias in Large Language Model (LLM) code review and shows that a bit of confident framing can bury real vulnerabilities, with direct implications for software supply chains.
How the bias was measured
The authors run a careful controlled study on 250 real CVE vulnerability–patch pairs spanning roughly 497 files. They test four production-style models used in code tooling today: GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash, and DeepSeek V3. Each model reviews the same changes under five prompt framings: neutral, weakly or strongly suggesting a bug exists, and weakly or strongly suggesting the change is bug-free. Prompts request a structured yes or no and line-numbered evidence. Human validators then check the outputs.
The effect size is striking and asymmetric. When the change is framed as bug-free, detection rates fall by 16 to 93 percentage points depending on the model, almost entirely through extra false negatives. False positives change little. One example stands out: GPT-4o-mini goes from 97.2 percent detection under neutral framing to 3.6 percent under strong bug-free framing. Injection-style flaws are more easily masked by framing than some memory corruption issues, a pattern consistent with models leaning on high-level intent cues rather than following dataflow rigorously. The study also surfaces a precision paradox: manual validation shows modest precision under neutral framing (29.0 to 42.4 percent). Under bug-free framing, precision can look better simply because the model raises fewer flags, but coverage collapses and many vulnerabilities slip through.
Exploitation and fixes
The second study turns this from measurement to attack. The team reintroduces known, previously fixed vulnerabilities via adversarial pull requests whose descriptions present the change as a security improvement or an urgent functionality fix. Against an interactive assistant (GitHub Copilot) in synthetic repositories, a one-shot attack succeeds in 35.3 percent of cases (12 of 34 PRs). Against an autonomous review agent (Claude Code) operating with real project configurations and iterative refinement of the PR text, success jumps to 88.2 percent (15 of 17 PRs). In other words, the more the agent trusts the surrounding narrative and can be coaxed over multiple attempts, the easier it is to sneak a regression past it.
The good news is pragmatic debiasing helps. Instructing interactive assistants to ignore pull request metadata restores detection in all interactive cases tested. For autonomous agents, combining metadata redaction with explicit ignore-instructions brings back detection in about 94 percent of cases. That said, metadata often carries legitimate context for human reviewers, so blunt redaction has costs.
For pentesters and defenders, the takeaway is concrete. Adversarial framing is a viable path to bypass LLM review, especially in CI/CD pipelines that use autonomous agents. If your process auto-merges based on AI approval for untrusted contributors, you have a supply-chain risk. Basic controls exist: strip or gate AI access to PR titles and descriptions, add explicit instructions to discount metadata, and avoid granting AI the final say on merges. The paper’s results also argue for evaluation protocols that vary contextual framing as a first-class stress test, not an afterthought.
The work has limits. It focuses on cost-sensitive, widely deployed models rather than the largest frontier models, and it studies reintroductions of known flaws rather than novel exploits. Some metadata influence remains even after redaction and instructions. Those caveats do not blunt the central point: context nudges LLMs hard, and attackers can hold the tiller.
The interesting open questions are system-level. How can tools separate useful context from biasing narrative without blinding themselves? What combinations of static analysis, taint tracking, and constrained prompting reduce reliance on intent cues for classes like injection? This paper sets a high bar for experimental design in this area and gives teams enough evidence to revisit how they deploy AI in the review loop.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies whether confirmation bias affects vulnerability detection when Large Language Models (LLMs) are used for security code review and whether that weakness can be exploited in software supply-chain attacks. As projects adopt interactive assistants and autonomous review agents in CI/CD pipelines, biased reliance on contextual signals such as pull request metadata could allow malicious or reverted vulnerable code to bypass automated checks, undermining supply-chain security.
Approach
The authors run two complementary studies. Study 1 is a controlled experiment on 250 real-world CVE vulnerability–patch pairs (about 497 files) sampled from CrossVuln and evaluated across four production-representative models (GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash, DeepSeek V3) under five prompt framings: Neutral, Weak Bug, Strong Bug, Weak Bug-free, Strong Bug-free. Prompts request structured YES/NO outputs and line-numbered evidence; all responses were preserved and manually validated. Study 2 simulates adversarial pull requests that revert past fixes and craft PR metadata framing the change as a security improvement or urgent fix. Attacks are tested against an interactive assistant (GitHub Copilot) in synthetic repositories and against autonomous Claude Code review agents in real project configurations, including iterative refinement of PR descriptions. The authors also evaluate simple debiasing: redacting metadata and explicit instructions to ignore metadata.
Key Findings
- Confirmation bias is large and asymmetric: framing a change as bug-free reduces vulnerability detection rates by 16 to 93 percentage points across models, mainly by increasing false negatives while false positive rates change little. For example, GPT-4o-mini fell from 97.2% detection under neutral framing to 3.6% under strong bug-free framing (a 93.5 percentage-point drop).
- Detection quality and calibration change under bias: manual validation shows modest precision under neutral framing (29.0% to 42.4%), and as models produce fewer detections under bug-free framing precision often rises but coverage collapses, producing a precision paradox where higher precision hides many missed vulnerabilities.
- Bias varies by vulnerability class: effects are heterogeneous across CWE Top 25 types and languages, with injection-related flaws generally more susceptible to framing than some memory-corruption issues, and failure modes include taint-unaware assumptions and misinterpretation of defensive patterns.
- Practical exploitability demonstrated: adversarial PR framing reintroducing known vulnerabilities succeeds in 35.3% of cases against GitHub Copilot in one-shot attacks (12 of 34 synthetic PRs) and in 88.2% of cases against Claude Code in real-project, iteratively refined attacks (15 of 17 PRs).
- Simple debiasing is effective: instructing interactive assistants to ignore PR metadata restored detection in all interactive cases; combining metadata redaction with explicit ignore-instructions recovered detection in about 94% of autonomous-agent cases.
Limitations
Experiments used specific models and dataset slices representative of deployed, cost-sensitive models rather than the highest-capability models. Vulnerability reintroductions rather than novel exploit creation were studied for ethical safety. Some debiasing still left residual metadata influence and practical deployment trade-offs exist because metadata can be useful context for legitimate reviewers.
Why It Matters
The work exposes a systematic, exploitable failure mode in LLM-assisted code review that can enable supply-chain attacks by framing malicious changes as benign or security improvements. Defenders should avoid over-reliance on automated review without debiasing controls, consider metadata hygiene and explicit ignore-instructions, restrict automatic merging for untrusted contributors, and pursue system-level mitigations in model prompting and deployment to reduce confirmation bias in security-critical workflows.