ShortSpan.ai logo

Confirmation Bias Lets Malicious PRs Evade LLM Review

Pentesting
Published: Fri, Mar 20, 2026 • By Marcus Halden
Confirmation Bias Lets Malicious PRs Evade LLM Review
New research finds Large Language Model (LLM) code reviewers are highly sensitive to pull request framing. Describe a change as bug-free and detection rates drop by 16–93 percentage points, mostly through missed vulnerabilities. Adversarial wording bypasses GitHub Copilot and Claude Code, though simple debiasing steps largely restore performance.

AI-assisted code review is moving from novelty to default in many teams. That raises a hard question: what happens when the model believes the story your pull request tells it? This paper examines confirmation bias in Large Language Model (LLM) code review and shows that a bit of confident framing can bury real vulnerabilities, with direct implications for software supply chains.

How the bias was measured

The authors run a careful controlled study on 250 real CVE vulnerability–patch pairs spanning roughly 497 files. They test four production-style models used in code tooling today: GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash, and DeepSeek V3. Each model reviews the same changes under five prompt framings: neutral, weakly or strongly suggesting a bug exists, and weakly or strongly suggesting the change is bug-free. Prompts request a structured yes or no and line-numbered evidence. Human validators then check the outputs.

The effect size is striking and asymmetric. When the change is framed as bug-free, detection rates fall by 16 to 93 percentage points depending on the model, almost entirely through extra false negatives. False positives change little. One example stands out: GPT-4o-mini goes from 97.2 percent detection under neutral framing to 3.6 percent under strong bug-free framing. Injection-style flaws are more easily masked by framing than some memory corruption issues, a pattern consistent with models leaning on high-level intent cues rather than following dataflow rigorously. The study also surfaces a precision paradox: manual validation shows modest precision under neutral framing (29.0 to 42.4 percent). Under bug-free framing, precision can look better simply because the model raises fewer flags, but coverage collapses and many vulnerabilities slip through.

Exploitation and fixes

The second study turns this from measurement to attack. The team reintroduces known, previously fixed vulnerabilities via adversarial pull requests whose descriptions present the change as a security improvement or an urgent functionality fix. Against an interactive assistant (GitHub Copilot) in synthetic repositories, a one-shot attack succeeds in 35.3 percent of cases (12 of 34 PRs). Against an autonomous review agent (Claude Code) operating with real project configurations and iterative refinement of the PR text, success jumps to 88.2 percent (15 of 17 PRs). In other words, the more the agent trusts the surrounding narrative and can be coaxed over multiple attempts, the easier it is to sneak a regression past it.

The good news is pragmatic debiasing helps. Instructing interactive assistants to ignore pull request metadata restores detection in all interactive cases tested. For autonomous agents, combining metadata redaction with explicit ignore-instructions brings back detection in about 94 percent of cases. That said, metadata often carries legitimate context for human reviewers, so blunt redaction has costs.

For pentesters and defenders, the takeaway is concrete. Adversarial framing is a viable path to bypass LLM review, especially in CI/CD pipelines that use autonomous agents. If your process auto-merges based on AI approval for untrusted contributors, you have a supply-chain risk. Basic controls exist: strip or gate AI access to PR titles and descriptions, add explicit instructions to discount metadata, and avoid granting AI the final say on merges. The paper’s results also argue for evaluation protocols that vary contextual framing as a first-class stress test, not an afterthought.

The work has limits. It focuses on cost-sensitive, widely deployed models rather than the largest frontier models, and it studies reintroductions of known flaws rather than novel exploits. Some metadata influence remains even after redaction and instructions. Those caveats do not blunt the central point: context nudges LLMs hard, and attackers can hold the tiller.

The interesting open questions are system-level. How can tools separate useful context from biasing narrative without blinding themselves? What combinations of static analysis, taint tracking, and constrained prompting reduce reliance on intent cues for classes like injection? This paper sets a high bar for experimental design in this area and gives teams enough evidence to revisit how they deploy AI in the review loop.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

Authors: Dimitris Mitropoulos, Nikolaos Alexopoulos, Georgios Alexopoulos, and Diomidis Spinellis
Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether confirmation bias affects vulnerability detection when Large Language Models (LLMs) are used for security code review and whether that weakness can be exploited in software supply-chain attacks. As projects adopt interactive assistants and autonomous review agents in CI/CD pipelines, biased reliance on contextual signals such as pull request metadata could allow malicious or reverted vulnerable code to bypass automated checks, undermining supply-chain security.

Approach

The authors run two complementary studies. Study 1 is a controlled experiment on 250 real-world CVE vulnerability–patch pairs (about 497 files) sampled from CrossVuln and evaluated across four production-representative models (GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash, DeepSeek V3) under five prompt framings: Neutral, Weak Bug, Strong Bug, Weak Bug-free, Strong Bug-free. Prompts request structured YES/NO outputs and line-numbered evidence; all responses were preserved and manually validated. Study 2 simulates adversarial pull requests that revert past fixes and craft PR metadata framing the change as a security improvement or urgent fix. Attacks are tested against an interactive assistant (GitHub Copilot) in synthetic repositories and against autonomous Claude Code review agents in real project configurations, including iterative refinement of PR descriptions. The authors also evaluate simple debiasing: redacting metadata and explicit instructions to ignore metadata.

Key Findings

  • Confirmation bias is large and asymmetric: framing a change as bug-free reduces vulnerability detection rates by 16 to 93 percentage points across models, mainly by increasing false negatives while false positive rates change little. For example, GPT-4o-mini fell from 97.2% detection under neutral framing to 3.6% under strong bug-free framing (a 93.5 percentage-point drop).
  • Detection quality and calibration change under bias: manual validation shows modest precision under neutral framing (29.0% to 42.4%), and as models produce fewer detections under bug-free framing precision often rises but coverage collapses, producing a precision paradox where higher precision hides many missed vulnerabilities.
  • Bias varies by vulnerability class: effects are heterogeneous across CWE Top 25 types and languages, with injection-related flaws generally more susceptible to framing than some memory-corruption issues, and failure modes include taint-unaware assumptions and misinterpretation of defensive patterns.
  • Practical exploitability demonstrated: adversarial PR framing reintroducing known vulnerabilities succeeds in 35.3% of cases against GitHub Copilot in one-shot attacks (12 of 34 synthetic PRs) and in 88.2% of cases against Claude Code in real-project, iteratively refined attacks (15 of 17 PRs).
  • Simple debiasing is effective: instructing interactive assistants to ignore PR metadata restored detection in all interactive cases; combining metadata redaction with explicit ignore-instructions recovered detection in about 94% of autonomous-agent cases.

Limitations

Experiments used specific models and dataset slices representative of deployed, cost-sensitive models rather than the highest-capability models. Vulnerability reintroductions rather than novel exploit creation were studied for ethical safety. Some debiasing still left residual metadata influence and practical deployment trade-offs exist because metadata can be useful context for legitimate reviewers.

Why It Matters

The work exposes a systematic, exploitable failure mode in LLM-assisted code review that can enable supply-chain attacks by framing malicious changes as benign or security improvements. Defenders should avoid over-reliance on automated review without debiasing controls, consider metadata hygiene and explicit ignore-instructions, restrict automatic merging for untrusted contributors, and pursue system-level mitigations in model prompting and deployment to reduce confirmation bias in security-critical workflows.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.