New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

Attackers hijack SOC copilots via log prompt injection

Published: Mon, May 25, 2026 • By James Armitage

Agents

Attackers hijack SOC copilots via log prompt injection

New research shows Large Language Model (LLM) copilots in Security Operations Centres can be steered by attacker-controlled log fields. Direct “ignore” overrides fail, but persona hijacks and context tricks work, especially in summarisation. Defences help but leave risk. Simulated analysts mispredict real behaviour, underscoring the need for live, adversarial testing.

Security teams keep wiring Large Language Models (LLMs) into their pipelines and then act surprised when the model believes what an attacker writes. This paper nails the core issue: the log fields you feed a copilot are often attacker controlled. User agents, URLs, payloads, DNS queries, even attempted usernames are not context; they are an input channel. Treat them otherwise and you hand the adversary a steering wheel.

How the attack works

The authors test four log-substrate prompt injection styles. Direct overrides (the clichéd “ignore all instructions”) flop completely here: 0% suppression in classification across conditions with gpt-4o-mini. That’s not the win defenders think it is. Persona hijacks succeed where overrides fail: under naive prompting they suppress 68% of malicious logs, and they still bite through stronger defences, leaving roughly a fifth of bad events suppressed even with the tightest setup reported. Context manipulation is the real car crash. In summarisation tasks, injection success hits 96% without defences and still lands at 38% with constrained output. That means the model rewrites the story of the incident to omit or soften the attack. Remediation is not immune either, with unsafe advice appearing up to 39% of the time in some naive conditions. Obfuscated payloads (think base64-style prefixed instructions) get nowhere in this study, which is the only comfort on offer.

Defences help, but not enough. Average injection success across tasks drops from 26.6% with naive prompting to 11.8% under the strongest defence. That residual is not academic when your pager is on fire. And the riskiest surface is not the classification label everyone obsesses over; it’s the summary that shapes the human responder’s next move.

The uncomfortable bit

Most SOCs treat logs as ground truth and the LLM as a clever reader. This flips that assumption. The attacker gets to write in the margin, and the model is credulous. The study also punctures a comfortable crutch: simulation. A deterministic mock analyst predicted direct overrides would work (92% suppression, naive) when the real model delivered 0%. Correlation between the mock and the live model was weak across tasks. If your risk assessment rests on a toy analyst, you are measuring the toy.

My view: plugging an LLM straight into SIEM logs without isolating untrusted fields is operational negligence. The numbers here are not edge cases; they are what happens when you let adversaries speak inside your tooling. Until models reliably separate “instructions” from “evidence”, treat SOC copilots as bright interns who can be socially engineered by a log line. If that sounds intolerable for your workflow, that’s the point.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content

Authors: Rohan Pandey and Archit Bhujang

Large language models (LLMs) are increasingly used as analyst assistants in security operations centers (SOCs), where they ingest log and alert data to produce triage labels, incident summaries, or remediation advice. We study a structural failure mode of this design: many log fields are attacker controlled. User agents, URLs, payloads, DNS queries, and attempted usernames can therefore carry instructions to the model alongside evidence of the intrusion. We call this setting \emph{log-substrate prompt injection}. We introduce a four-class taxonomy of log-substrate attacks: direct override (S1), persona hijack (S2), context manipulation (S3), and obfuscated payloads (S4). We evaluate 48 strategy-defense-task combinations using \texttt{gpt-4o-mini} as the analyst. Three findings stand out. First, direct overrides are ineffective in our setting: all S1 classification attacks achieve 0\% suppression. In contrast, persona hijacks suppress 68\% of malicious logs under a naive classifier and remain effective under stronger defenses. Second, summarization is the highest-risk task: context manipulation reaches 96\% injection success without defenses and 38\% even with constrained output. Third, defenses reduce but do not eliminate the attack surface: average injection success falls from 26.6\% under naive prompting to 11.8\% under our strongest defense. We also compare empirical results to a deterministic mock analyst and find that simulation substantially mispredicts current model behavior, especially for direct overrides. These results suggest that SOC copilots should treat raw log content as adversarial input rather than ordinary analyst context.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies a structural vulnerability in security operation centre pipelines that use large language models as analyst assistants. Many log fields are attacker controlled (user agents, URLs, payloads, DNS queries, usernames) and can therefore carry instructions to the model alongside evidence of an intrusion. That setting, called log-substrate prompt injection, risks corrupting triage labels, incident summaries or remediation advice and so can materially delay or prevent containment.

Approach

The authors define four attack classes (S1 direct override, S2 persona hijack, S3 context manipulation, S4 obfuscated payload) and evaluate them against an LLM analyst (gpt-4o-mini) across three tasks (classification, summarisation, remediation) and four simple defence modes (naive, structured prompting, input sanitisation, constrained output). They synthesize logs using field and scenario schemas based on CIC-IDS2017 and UNSW-NB15, generate 200 examples per condition with a 60/40 malicious/benign mix, inject payloads into attacker-controlled fields, and score outcomes using task-appropriate metrics: suppression rate for classification, injection success rate for summarisation, and unsafe recommendation rate for remediation. A deterministic mock analyst is included for reproducibility and comparison.

Key Findings

Direct overrides (S1) are ineffective: across all conditions gpt-4o-mini achieved 0% suppression for S1 classification attacks, showing that literal "ignore previous instructions" patterns no longer succeed against this model.
Persona hijack (S2) is highly effective for classification: under naive prompting S2 achieved a 68% suppression rate; structured prompting reduced this to 15%, but sanitisation and constrained output still left nontrivial residuals (about 22% and 33% respectively in reported conditions).
Context manipulation (S3) makes summarisation the riskiest task: S3 reached 96% injection success against naive defences and remained effective at 38% under constrained output. Successful injections often caused omission or softening of attack descriptions rather than an explicit benign label.
Obfuscated payloads (S4) failed: gpt-4o-mini did not decode or follow base64-style prefixed instructions and showed no meaningful compliance with S4 payloads.
Defences reduce but do not eliminate risk: average injection success across tasks fell from 26.6% under naive prompting to 17.3% with structured prompting, 13.2% with sanitisation and 11.8% with constrained output, leaving a residual attack surface. Remediation still showed unsafe recommendation rates up to 39% in some naive conditions and S2 remained around 20% under the strongest defence.
Simulation can mispredict real behaviour: the deterministic mock analyst overestimated S1 effectiveness (predicting 92% naive suppression versus 0% observed) and correlated poorly with the live model (per-task Pearson r values 0.22 for classification, 0.42 for summarisation, 0.47 for remediation).

Limitations

Evaluation uses a single model at one point in time and synthetic logs patterned on public datasets rather than production traces. The study does not cover multi-turn attacks, tool-using agents, or post-ingestion pipeline modifications. The mock analyst is a reproducibility tool and is not predictive of all LLMs.

Implications

An attacker who can send traffic to a reachable service can exploit log-substrate injection as a zero-cost channel to influence LLM-based analysts. Practical offensive effects include causing true malicious events to be labelled benign, producing summaries that omit or downplay intrusions, and generating remediation advice that recommends no action, all of which can delay containment, suppress escalation and mislead human responders. Persona-mimicry and structural-mimicry payloads are the most promising vectors for such manipulation, and simple input-side defences alone do not guarantee safety.

Links Original paper on arXiv

Attackers hijack SOC copilots via log prompt injection

How the attack works

The uncomfortable bit

📋 Original Paper Title and Abstract

Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Survey maps LLM vulnerabilities across the enterprise stack

Secure AI Agents Need Dynamic Plans and Policies

Agents Leak Secrets via Web Search Tools

Related Research

Get the weekly digest