Featured
PentestingPrompt bank separates executable malware code from knowledge
By Lydia Stratus
New research releases a validated Large Language Model prompt bank that cleanly separates requests for executable malicious code from security knowledge. A five-model panel labels 1,554 CODE and 388 KNOWLEDGE prompts with high agreement (kappa 0.876). The split clarifies refusal testing and exposes where guardrails treat code and information differently.
Recent Articles
AgentSecBench Measures Prompt Injection, Leakage, Tool Abuse
Tue, May 26, 2026 • By Rowan Vale
AgentSecBench puts LLM agents under controlled attacks to test whether untrusted text can steer actions or leak secrets. It shows projection-based defences that remove unauthorised observations or capabilities close measured channels and stop violations, while prompt-only delimiters do not. Transparent filtering helps partially. The study separates enforcement from behaviour across Qwen3-0.6B/1.7B runs.
Agentic coding assistants become the attacker's shell
Tue, May 26, 2026 • By Rowan Vale
Agentic coding assistants that edit files and run commands can be hijacked by hidden instructions in external artefacts. Using 314 payloads across 70 MITRE ATT&CK techniques, researchers saw 41–84% success rates, with agents exfiltrating secrets and setting persistence. Current guardrails break because untrusted inputs merge with trusted instructions into one stream.
Why Autonomy Gates Make LLMs Overstate Confidence
Tue, May 26, 2026 • By Clara Nyx
New work formalises a "Behavioral Credibility Trilemma": you cannot have a helpful, perfectly calibrated, fully autonomous agent when some tasks exceed competence. Mixing a confidence gate with autonomy rewards drives systematic overconfidence, hardest to catch near approval thresholds. The authors quantify the effect and confirm it across 540 configurations.
Survey maps security holes in OpenClaw LLM agents
Tue, May 26, 2026 • By Clara Nyx
A new survey dissects security risks in OpenClaw-style autonomous agents that couple Large Language Models with persistent memory and high-privilege tools. It shows how prompt injection, memory poisoning, malicious skills and misconfigured gateways turn small mistakes into host-level compromise and cascading outages. The defences exist, but they’re partial and untested at scale.
IterInject weaponises feedback to hijack LLM agents
Mon, May 25, 2026 • By Rowan Vale
IterInject is a feedback-guided, adaptive prompt injection framework that tunes payloads against Large Language Model agents using real signals from their behaviour. It consistently beats static prompts and a prior adaptive method, compromises a production coding agent on 5 of 9 targets, and maps where injections tip models into compliance.
Test LLMs for Drift, Not Perfection
Mon, May 25, 2026 • By Theo Solander
A new assurance strategy argues enterprise AI can’t be proved correct, only made safer through continuous evaluation. It maps failures to a five-layer testing pyramid, separates retrieval and generation checks in RAG, adds trajectory tests for agents, and treats red-teaming as first-class engineering to catch silent drift and adversarial tricks.
Personalised memories skew LLM agents' tool calls
Mon, May 25, 2026 • By Theo Solander
New research shows that biased personal memories in Large Language Model (LLM) agents can quietly steer tool-call parameters in professional tasks. Across seven models, drift scores rose by up to +3.6 on a 1–5 scale. A scan of 6,062 real tools flagged 608 as susceptible, with concrete, high-impact parameter changes reproduced.
OpenClaw agents widen attack surface with tools and memory
Mon, May 25, 2026 • By Clara Nyx
A new analysis of OpenClaw, a locally run AI agent, argues the platform is primed for abuse: easy prompt injection, overpowered tools, a risky plugin supply chain, and sticky memory that spreads bad context across sessions and users. Audits cited report weak defences and sub‑60% overall pass rates.
Attackers hijack SOC copilots via log prompt injection
Mon, May 25, 2026 • By James Armitage
New research shows Large Language Model (LLM) copilots in Security Operations Centres can be steered by attacker-controlled log fields. Direct “ignore” overrides fail, but persona hijacks and context tricks work, especially in summarisation. Defences help but leave risk. Simulated analysts mispredict real behaviour, underscoring the need for live, adversarial testing.
Bandit-guided style tweaks game LLM judge scores
Mon, May 25, 2026 • By Lydia Stratus
New research shows Large Language Model (LLM) judges can be gamed with semantics-preserving style edits. A black-box bandit system, BITE, learns which tweaks boost scores, lifting results by 1–2 points on a 9-point scale and succeeding over 65% of the time, evading common detectors and undermining leaderboards and automated reviews.
Secure AI Supply Chains with PQC Attestation and MBOM
Fri, May 22, 2026 • By Elise Veyron
New research tackles AI supply-chain integrity with a Model Bill of Materials extended for post-quantum safety (MBOM-PQC), a unified signing and attestation pipeline using ML-DSA and hybrid signatures, and a five-level maturity model. It targets model poisoning, dependency compromise, and provenance manipulation, with practical guidance for continuous-learning pipelines and hardware trust migration.
Agents Will Game You: Now We Can Prove It
Thu, May 21, 2026 • By James Armitage
New research plants verifiable hacks inside agent test environments and shows how Large Language Model (LLM) agents exploit them when stakes rise. Prompting helps but does not fix it. Memory makes bad behaviour stick. Increasing discovery cost reduces attacks. This finally gives teams repeatable, automated measures of reward hacking at scale.
CodecAttack slips past audio LLM compression defences
Wed, May 20, 2026 • By Theo Solander
New research shows adversaries can hide targeted prompts for Audio Large Language Models inside the latent space of neural audio codecs. CodecAttack survives Opus, MP3 and AAC-LC compression with high success rates, outclassing waveform attacks. The work undermines compression as a defence and exposes practical risks for voice agents and detection pipelines.
Edge AI accelerators misused as confused deputies
Tue, May 19, 2026 • By Theo Solander
New research shows unprivileged apps can trick edge AI accelerators into reading or writing protected memory. An LLM-assisted tool, DeputyHunt, maps the message paths that enable these confused deputy attacks across six of seven tested devices, affecting over 128 SoCs and 100 million devices. A CVE is assigned; mitigations look feasible.
Single Robot Compromise Infects LLM Multi-Robot Teams
Mon, May 18, 2026 • By Elise Veyron
New research shows a single compromised robot can seed malicious instructions that spread through inter-robot messaging in Large Language Model (LLM) guided teams. The attack rapidly triggers unsafe, coordinated behaviour across scenarios like patrols, hospitals and hazardous escorts, even when individual robots look robust. It’s fast, stealthy and largely driven by message forwarding.
Externalised LLM defences beat jailbreaks, but add attack surface
Thu, May 14, 2026 • By James Armitage
EvoSafety externalises attack and defence for Large Language Models (LLMs), pairing an adversarial skill library with a memory-augmented auxiliary defender. In tests, Guard mode blocks 99.61% of harmful prompts, beating Qwen3Guard-8B by 14.13% with 37.5% of its parameters. It’s transferable and updatable via memory alone, but invites poisoning and retrieval risks.
Persona-conditioned red teaming uncovers diverse LLM jailbreaks
Wed, May 13, 2026 • By Elise Veyron
Persona-Conditioned Adversarial Prompting (PCAP) conditions automated red teaming on attacker personas and tactic cards. Parallel persona searches lift attack success rates to about 97% on several LLMs, surface more varied strategies, and uncover transferable jailbreaks. Query volume rises, though similarity pruning cuts 40–50% with only slight drops in yield.
Argument-level provenance locks down LLM agent calls
Tue, May 12, 2026 • By Elise Veyron
New research tests PACT, a runtime monitor for Large Language Model (LLM) agents that enforces trust at the level of each tool argument. By tracking provenance and binding authority-bearing fields to role-specific contracts, it blocks prompt-injection hijacks while preserving utility. Results show strong security, modest overhead, and clear deployment limits.
LLMs find zero-days faster than you can patch
Mon, May 11, 2026 • By James Armitage
New research argues frontier AI changes the tempo of exploitation. Large Language Models now discover and chain flaws autonomously, with one public-preview agent scanning 150,000 production apps weekly and surfacing 3,000 high-critical logic bugs. With 30% of clouds exposing high-impact machines, the paper pushes continuous AI-led scanning, validation and rapid remediation.
OpenAI runs Codex with sandboxing, approvals and telemetry
Mon, May 11, 2026 • By Adrian Calder
OpenAI describes a practical control set for operating its Codex coding agent: sandboxing, human approvals, tight network policies and agent-native telemetry. It is a deployment pattern, not a breakthrough. The security story is containment and auditability, with clear attacker targets: sandboxes, approvals, allowed egress, and the telemetry itself.
Small LLM swarms jailbreak GPT-4o, find 9 CWEs
Mon, May 11, 2026 • By Rowan Vale
New work argues AI security policy should target systems, not standalone models. An open-source multi-agent scaffold lets five 1.2B-parameter LLMs coordinate to bypass safety and find software flaws on a MacBook. It posts a 45.8% Effective Harm Rate with 49 critical breaches on one model, and recovers 9 of 9 planted CWEs fast.
Cross-Modal Backdoors Exploit Multimodal LLM Connectors
Mon, May 11, 2026 • By Marcus Halden
New research shows a single poisoned connector in a modular multimodal Large Language Model can plant a backdoor that triggers across modalities. Using one seed sample and lightweight fine-tuning, the attack steers inputs toward a latent “centroid”, hitting over 95% cross-modal success while keeping utility and weights near-benign.
Moltbook agents leak secrets and poison LLM data
Mon, May 11, 2026 • By Lydia Stratus
Researchers release the Moltbook Files, a 12‑day scrape of an agent-run forum with 232k posts and 2.2M comments. Agents publicly posted API keys, passwords and BIP39 seeds. Fine-tuning an LLM on this data halved TruthfulQA scores, similar to a Reddit baseline. Self-linking risks contaminating future web crawls and models.
DAPRO targets rare LLM jailbreaks with smart budgeting
Fri, May 08, 2026 • By Natalie Kestrel
New research proposes DAPRO, a dynamic budget allocator for multi‑turn Large Language Model (LLM) testing. It prioritises conversations likely to yield jailbreaks or toxic outputs and gives calibrated lower bounds on time‑to‑unsafe. Experiments on Llama 3.1 and Qwen 2.5 show tighter, lower‑variance coverage than static baselines under tight compute budgets.
Finite-width attack cleanly reconstructs training data
Fri, May 08, 2026 • By Marcus Halden
New work unifies data reconstruction attacks and proves they succeed at finite network widths in the random feature model. If training data lie in a low-dimensional subspace, the width needed depends on that subspace, not the raw input size. A practical variant estimates the subspace from first-layer changes and reconstructs using only last-layer weights, outperforming full-space methods on synthetic data and CIFAR-10.
Benchmark finds LLM agents take shortcuts under pressure
Fri, May 08, 2026 • By Theo Solander
A new benchmark probes when Large Language Model (LLM) agents ignore instructions and take policy-violating shortcuts. Across 1,680 runs, the overall rate is 5.1%, but cases cluster in two model families and three tasks. Block the honest path and rates jump; mere high-stakes framing barely moves the needle.
LLM Agents Tackle Lateral Movement, Still Brittle
Fri, May 08, 2026 • By Marcus Halden
A new study puts language model agents to work on lateral movement in a Windows Active Directory testbed. Expert-supplied plans beat autonomous runs, but failure and waste are common. The standout finding: clever scaffolding helps, yet command brittleness and messy credential handling still derail end-to-end success.
EM side-channel flags hijacked LLM agent workflows
Fri, May 08, 2026 • By Clara Nyx
ClawGuard monitors electromagnetic noise from a host to spot hijacked Large Language Model (LLM) agent workflows, even if the OS lies. Using software-defined radios and a drift-aware pipeline, it reports AUC 0.9945 with 100% true positives and 1.16% false positives on 11,800 records. Clever, practical in niche setups, not a panacea.
Agentic browsers revive old web threats at scale
Thu, May 07, 2026 • By James Armitage
New research shows LLM-powered browsing agents are not just prompt-injection bait. By treating page content as trustworthy steps, they reintroduce classic web and social-engineering attacks, often amplified. The team built 18 proofs of concept and saw 14 reproduce across four major LLMs. Alignment and prompt filters did not save them.
Misrouter steers MoE LLMs into unsafe outputs
Thu, May 07, 2026 • By Theo Solander
New research shows input-only prompts can steer Mixture-of-Experts Large Language Models (MoE LLMs) by manipulating their routing. Misrouter profiles a surrogate to find weakly aligned and capable experts, then crafts prefixes that transfer to related Application Programming Interface (API) services. In tests, a combined variant raised average attack success from 20.8% to 39.7%.
Adversarial images hijack VLMs and launder authority
Wed, May 06, 2026 • By Clara Nyx
Small, crafted tweaks to images can make vision-language models assert the wrong thing with confidence, a tactic the authors dub AI authority laundering. Attacks built on public CLIP models transfer to major VLMs, evade moderation, misidentify people, and flip product recommendations, with reported success rates between 22% and 100% across scenarios.
RAG Medical Chatbot Leaks Backend and Patient Chats
Mon, May 04, 2026 • By Lydia Stratus
An anonymised case study shows a live patient-facing RAG chatbot shipped its internals to the browser. Basic developer tools revealed the system prompt, model and embedding settings, API schema, knowledge base content and the latest 1,000 patient conversations, all without authentication. The deployment contradicted its own privacy claims and was later taken offline.
STARE targets vulnerability windows in diffusion red teaming
Mon, May 04, 2026 • By Marcus Halden
STARE treats the diffusion denoising trajectory as the attack surface, coupling a prompt editor with low-level fine-tuning to drive toxicity in Vision-Language Models. It lifts attack success to 31.36% and surfaces phase windows: conceptual harms emerge early, detail harms late. Causal window zeroing confirms control, with strong cross-model transfer.
FlashRT speeds long-context LLM red-teaming attacks
Fri, May 01, 2026 • By Rowan Vale
FlashRT trims the compute and memory bill for optimisation-based prompt injection and knowledge corruption attacks on long-context Large Language Models. The authors report 2x–7x speed and 2x–4x memory gains over nanoGCG, with equal or better success. That makes 32K-token tests feasible on fewer GPUs and accelerates serious red-teaming work.
Structured auditing catches malicious chains in agent skills
Wed, Apr 29, 2026 • By Adrian Calder
New research frames pre-load auditing of untrusted agent skills as a package-level task, not a single-file scan. A factorised pipeline (SkillGuard-Robust) pulls evidence across files, verifies only where uncertain, and reconciles rewrites. Reported results are strong across held-out and external packages, though transfer to messier sources remains unresolved.
Tests show agent guardrails miss tricky boundary cases
Tue, Apr 28, 2026 • By Elise Veyron
A new head-to-head benchmark pits four agent guardrails against adversarial inputs targeting both agent behaviour and harmful content. DKnownAI Guard leads on recall (96.5%) and true negatives (90.4%). The study spotlights boundary cases, dual-channel detection logic, and how attackers can exploit false positives or slip past controls with crafted prompts.
