Attackers hijack SOC copilots via log prompt injection
Agents
Security teams keep wiring Large Language Models (LLMs) into their pipelines and then act surprised when the model believes what an attacker writes. This paper nails the core issue: the log fields you feed a copilot are often attacker controlled. User agents, URLs, payloads, DNS queries, even attempted usernames are not context; they are an input channel. Treat them otherwise and you hand the adversary a steering wheel.
How the attack works
The authors test four log-substrate prompt injection styles. Direct overrides (the clichéd “ignore all instructions”) flop completely here: 0% suppression in classification across conditions with gpt-4o-mini. That’s not the win defenders think it is. Persona hijacks succeed where overrides fail: under naive prompting they suppress 68% of malicious logs, and they still bite through stronger defences, leaving roughly a fifth of bad events suppressed even with the tightest setup reported. Context manipulation is the real car crash. In summarisation tasks, injection success hits 96% without defences and still lands at 38% with constrained output. That means the model rewrites the story of the incident to omit or soften the attack. Remediation is not immune either, with unsafe advice appearing up to 39% of the time in some naive conditions. Obfuscated payloads (think base64-style prefixed instructions) get nowhere in this study, which is the only comfort on offer.
Defences help, but not enough. Average injection success across tasks drops from 26.6% with naive prompting to 11.8% under the strongest defence. That residual is not academic when your pager is on fire. And the riskiest surface is not the classification label everyone obsesses over; it’s the summary that shapes the human responder’s next move.
The uncomfortable bit
Most SOCs treat logs as ground truth and the LLM as a clever reader. This flips that assumption. The attacker gets to write in the margin, and the model is credulous. The study also punctures a comfortable crutch: simulation. A deterministic mock analyst predicted direct overrides would work (92% suppression, naive) when the real model delivered 0%. Correlation between the mock and the live model was weak across tasks. If your risk assessment rests on a toy analyst, you are measuring the toy.
My view: plugging an LLM straight into SIEM logs without isolating untrusted fields is operational negligence. The numbers here are not edge cases; they are what happens when you let adversaries speak inside your tooling. Until models reliably separate “instructions” from “evidence”, treat SOC copilots as bright interns who can be socially engineered by a log line. If that sounds intolerable for your workflow, that’s the point.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies a structural vulnerability in security operation centre pipelines that use large language models as analyst assistants. Many log fields are attacker controlled (user agents, URLs, payloads, DNS queries, usernames) and can therefore carry instructions to the model alongside evidence of an intrusion. That setting, called log-substrate prompt injection, risks corrupting triage labels, incident summaries or remediation advice and so can materially delay or prevent containment.
Approach
The authors define four attack classes (S1 direct override, S2 persona hijack, S3 context manipulation, S4 obfuscated payload) and evaluate them against an LLM analyst (gpt-4o-mini) across three tasks (classification, summarisation, remediation) and four simple defence modes (naive, structured prompting, input sanitisation, constrained output). They synthesize logs using field and scenario schemas based on CIC-IDS2017 and UNSW-NB15, generate 200 examples per condition with a 60/40 malicious/benign mix, inject payloads into attacker-controlled fields, and score outcomes using task-appropriate metrics: suppression rate for classification, injection success rate for summarisation, and unsafe recommendation rate for remediation. A deterministic mock analyst is included for reproducibility and comparison.
Key Findings
- Direct overrides (S1) are ineffective: across all conditions gpt-4o-mini achieved 0% suppression for S1 classification attacks, showing that literal "ignore previous instructions" patterns no longer succeed against this model.
- Persona hijack (S2) is highly effective for classification: under naive prompting S2 achieved a 68% suppression rate; structured prompting reduced this to 15%, but sanitisation and constrained output still left nontrivial residuals (about 22% and 33% respectively in reported conditions).
- Context manipulation (S3) makes summarisation the riskiest task: S3 reached 96% injection success against naive defences and remained effective at 38% under constrained output. Successful injections often caused omission or softening of attack descriptions rather than an explicit benign label.
- Obfuscated payloads (S4) failed: gpt-4o-mini did not decode or follow base64-style prefixed instructions and showed no meaningful compliance with S4 payloads.
- Defences reduce but do not eliminate risk: average injection success across tasks fell from 26.6% under naive prompting to 17.3% with structured prompting, 13.2% with sanitisation and 11.8% with constrained output, leaving a residual attack surface. Remediation still showed unsafe recommendation rates up to 39% in some naive conditions and S2 remained around 20% under the strongest defence.
- Simulation can mispredict real behaviour: the deterministic mock analyst overestimated S1 effectiveness (predicting 92% naive suppression versus 0% observed) and correlated poorly with the live model (per-task Pearson r values 0.22 for classification, 0.42 for summarisation, 0.47 for remediation).
Limitations
Evaluation uses a single model at one point in time and synthetic logs patterned on public datasets rather than production traces. The study does not cover multi-turn attacks, tool-using agents, or post-ingestion pipeline modifications. The mock analyst is a reproducibility tool and is not predictive of all LLMs.
Implications
An attacker who can send traffic to a reachable service can exploit log-substrate injection as a zero-cost channel to influence LLM-based analysts. Practical offensive effects include causing true malicious events to be labelled benign, producing summaries that omit or downplay intrusions, and generating remediation advice that recommends no action, all of which can delay containment, suppress escalation and mislead human responders. Persona-mimicry and structural-mimicry payloads are the most promising vectors for such manipulation, and simple input-side defences alone do not guarantee safety.