LLMs Aid SOC Analysts, But Do Not Replace Them
Enterprise
The new field study is refreshingly honest and quietly alarming. Researchers tracked 3,090 queries from 45 SOC analysts over 10 months and found LLMs mostly doing grunt sensemaking: decoding commands, clarifying logs, and tidying writeups. They help, but they do not replace the analyst in the loop.
That matters because vendors and press love bold headlines. Here the real story is less glamorous. Analysts use short 1 to 3 turn exchanges to get context or reword a report. Only about 4 percent of interactions ask for explicit recommendations. In short, LLMs are a tool, not an oracle.
The good news is practical: these models reduce friction on tiny tasks that otherwise interrupt attention during a crisis. The worrying news is structural. Usage skews toward a handful of power users, the study covers a single enterprise deployment, and no hard outcomes like time-to-triage or missed alerts are reported. That leaves room for complacency and for leaders to assume benefits that are not yet proven.
Design takeaway: toolmakers must surface evidence, not rhetoric. Show the telemetry used for a suggestion, flag uncertainty, and avoid turning helpful drafts into unquestioned decisions.
Do this now: require any SOC LLM trial to log queries and outputs for audit, and run a short pilot that measures time-to-triage and error rates before you scale. Small, measurable proofs beat optimism and marketing every time.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies how large language models (LLMs) are actually used by Security Operations Centre (SOC) analysts in live operations, addressing a gap between laboratory evaluations and real-world practice. This matters because LLMs could change analyst workload, decision processes and tooling in time-sensitive, high-stakes security environments.
Approach
Longitudinal analysis of 3,090 valid analyst queries submitted by 45 SOC analysts over 10 months to an internally deployed GPT-4 instance (GPT-4-0613) without fine-tuning or internet access. Researchers anonymised data and used a six-phase mixed method: exploratory statistics, manual coding and conversation tagging, semantic clustering with Sentence-BERT, and triangulation. Inter-rater reliability scores reported: Fleiss' Kappa 0.90, 0.82, 0.79 for coding dimensions and 0.75 for conversation tagging. Analysts were free to use the system but instructed not to share sensitive data.
Key Findings
- Adoption: usage rose from under 10 to over 30 queries per day, driven by a subset; one power user submitted ~600 queries (17% of dataset).
- Primary uses: 31% of queries were command interpretation, 22% text editing/rewriting, 11% code/script/regex analysis; 93% of queries aligned with NICE cybersecurity competencies.
- Interaction style: most conversations were short and iterative—57% two-step, 75% two to three queries; analysts favoured short 1–3 turn exchanges and retained final decision authority.
- Behavioural patterns: analysts used the LLM as an on-demand cognitive aid for sensemaking and communication rather than for prescriptive judgements; only ~4% requested explicit recommendations.
- Query/response stats: mean analyst query length 25 words, mean LLM response 161 words; median gap between visits 1–2 hours.
Limitations
Single enterprise SOC and single model deployment limit generalisability; no objective performance metrics (time-to-triage, accuracy) were measured; potential novelty effects and lack of direct interviews to validate motives. Other limitations: not reported.
Why It Matters
LLMs can augment SOC workflows by interpreting low-level telemetry and offloading documentation tasks, improving situational awareness and efficiency while preserving analyst autonomy. Design priorities include embedding context-aware explanations, supporting microtasks to reduce context switching, and surfacing evidence rather than definitive recommendations to limit over-reliance. Further multi-site, outcome-focused studies are needed.