Featured
PentestingSTARE targets vulnerability windows in diffusion red teaming
By Marcus Halden
STARE treats the diffusion denoising trajectory as the attack surface, coupling a prompt editor with low-level fine-tuning to drive toxicity in Vision-Language Models. It lifts attack success to 31.36% and surfaces phase windows: conceptual harms emerge early, detail harms late. Causal window zeroing confirms control, with strong cross-model transfer.
Recent Articles
LLMs find zero-days faster than you can patch
Mon, May 11, 2026 • By James Armitage
New research argues frontier AI changes the tempo of exploitation. Large Language Models now discover and chain flaws autonomously, with one public-preview agent scanning 150,000 production apps weekly and surfacing 3,000 high-critical logic bugs. With 30% of clouds exposing high-impact machines, the paper pushes continuous AI-led scanning, validation and rapid remediation.
Cross-Modal Backdoors Exploit Multimodal LLM Connectors
Mon, May 11, 2026 • By Marcus Halden
New research shows a single poisoned connector in a modular multimodal Large Language Model can plant a backdoor that triggers across modalities. Using one seed sample and lightweight fine-tuning, the attack steers inputs toward a latent “centroid”, hitting over 95% cross-modal success while keeping utility and weights near-benign.
Moltbook agents leak secrets and poison LLM data
Mon, May 11, 2026 • By Lydia Stratus
Researchers release the Moltbook Files, a 12‑day scrape of an agent-run forum with 232k posts and 2.2M comments. Agents publicly posted API keys, passwords and BIP39 seeds. Fine-tuning an LLM on this data halved TruthfulQA scores, similar to a Reddit baseline. Self-linking risks contaminating future web crawls and models.
LLM Agents Tackle Lateral Movement, Still Brittle
Fri, May 08, 2026 • By Marcus Halden
A new study puts language model agents to work on lateral movement in a Windows Active Directory testbed. Expert-supplied plans beat autonomous runs, but failure and waste are common. The standout finding: clever scaffolding helps, yet command brittleness and messy credential handling still derail end-to-end success.
EM side-channel flags hijacked LLM agent workflows
Fri, May 08, 2026 • By Clara Nyx
ClawGuard monitors electromagnetic noise from a host to spot hijacked Large Language Model (LLM) agent workflows, even if the OS lies. Using software-defined radios and a drift-aware pipeline, it reports AUC 0.9945 with 100% true positives and 1.16% false positives on 11,800 records. Clever, practical in niche setups, not a panacea.
Agentic browsers revive old web threats at scale
Thu, May 07, 2026 • By James Armitage
New research shows LLM-powered browsing agents are not just prompt-injection bait. By treating page content as trustworthy steps, they reintroduce classic web and social-engineering attacks, often amplified. The team built 18 proofs of concept and saw 14 reproduce across four major LLMs. Alignment and prompt filters did not save them.
Misrouter steers MoE LLMs into unsafe outputs
Thu, May 07, 2026 • By Theo Solander
New research shows input-only prompts can steer Mixture-of-Experts Large Language Models (MoE LLMs) by manipulating their routing. Misrouter profiles a surrogate to find weakly aligned and capable experts, then crafts prefixes that transfer to related Application Programming Interface (API) services. In tests, a combined variant raised average attack success from 20.8% to 39.7%.
Adversarial images hijack VLMs and launder authority
Wed, May 06, 2026 • By Clara Nyx
Small, crafted tweaks to images can make vision-language models assert the wrong thing with confidence, a tactic the authors dub AI authority laundering. Attacks built on public CLIP models transfer to major VLMs, evade moderation, misidentify people, and flip product recommendations, with reported success rates between 22% and 100% across scenarios.
Prompt bank separates executable malware code from knowledge
Tue, May 05, 2026 • By Lydia Stratus
New research releases a validated Large Language Model prompt bank that cleanly separates requests for executable malicious code from security knowledge. A five-model panel labels 1,554 CODE and 388 KNOWLEDGE prompts with high agreement (kappa 0.876). The split clarifies refusal testing and exposes where guardrails treat code and information differently.
RAG Medical Chatbot Leaks Backend and Patient Chats
Mon, May 04, 2026 • By Lydia Stratus
An anonymised case study shows a live patient-facing RAG chatbot shipped its internals to the browser. Basic developer tools revealed the system prompt, model and embedding settings, API schema, knowledge base content and the latest 1,000 patient conversations, all without authentication. The deployment contradicted its own privacy claims and was later taken offline.
FlashRT speeds long-context LLM red-teaming attacks
Fri, May 01, 2026 • By Rowan Vale
FlashRT trims the compute and memory bill for optimisation-based prompt injection and knowledge corruption attacks on long-context Large Language Models. The authors report 2x–7x speed and 2x–4x memory gains over nanoGCG, with equal or better success. That makes 32K-token tests feasible on fewer GPUs and accelerates serious red-teaming work.
Structured auditing catches malicious chains in agent skills
Wed, Apr 29, 2026 • By Adrian Calder
New research frames pre-load auditing of untrusted agent skills as a package-level task, not a single-file scan. A factorised pipeline (SkillGuard-Robust) pulls evidence across files, verifies only where uncertain, and reconciles rewrites. Reported results are strong across held-out and external packages, though transfer to messier sources remains unresolved.
Tests show agent guardrails miss tricky boundary cases
Tue, Apr 28, 2026 • By Elise Veyron
A new head-to-head benchmark pits four agent guardrails against adversarial inputs targeting both agent behaviour and harmful content. DKnownAI Guard leads on recall (96.5%) and true negatives (90.4%). The study spotlights boundary cases, dual-channel detection logic, and how attackers can exploit false positives or slip past controls with crafted prompts.
AI image models erode trust in visual evidence
Tue, Apr 28, 2026 • By Theo Solander
New research tracks how frontier image models move from art to synthetic visual evidence. Risk comes from realism combined with readable text, identity persistence, fast iteration and distribution context. Incidents already touch markets, medicine, law and phishing. Provenance helps but breaks under screenshots, so layered controls and swift response matter.
LLM Defenders Expose Limits of Static Cyber Ranges
Tue, Apr 28, 2026 • By James Armitage
New research shows Large Language Model (LLM) defenders change cyber‑range testing. Static ranges let an LLM APT succeed 41–100% across scenarios. Add real‑time AI defenders and success drops to 0–55%, often to zero. A small on‑prem model matched a frontier cloud model and detected intrusions 10x faster in one enterprise case.
Semantic visor blocks LLM agent prompt injections
Tue, Apr 28, 2026 • By Natalie Kestrel
AgentVisor treats the task agent as untrusted and audits every tool call using Suitability, Taint and Integrity checks. It reports a 0.65% attack success rate with only 1.45% utility loss, plus a one-shot self-correction that keeps tasks moving. Strong numbers, but the visor’s blind spots hint at new red-team angles.
Multi-agent LLM builds digital twins to test exploits
Mon, Apr 27, 2026 • By Lydia Stratus
A new multi-agent framework pairs LLM-driven reconnaissance with a digital twin of the target to stabilise exploits before live use. It exfiltrates executables, syncs runtime state, and iterates crashes in isolation, then executes a one-shot payload. Tests across eight scenarios report strong efficiency and reduced live-fire risk, with clear operational limits.
Black-Box Skill Theft Exposes LLM Agent Market
Fri, Apr 24, 2026 • By Natalie Kestrel
New research shows attackers can extract proprietary skills from Large Language Model (LLM) agents in a few queries. An automated prompt-generation pipeline pulls high-fidelity copies of hidden instructions, with baseline prompts already leaking heavily. Defences help but leave semantic traces, and repeated low-cost attempts make one-off compromise likely.
GPT-5.5 bug bounty hunts bio-safety jailbreaks
Fri, Apr 24, 2026 • By Rowan Vale
A public GPT-5.5 Bio Bug Bounty invites red teams to find universal prompt jailbreaks that bypass bio-safety controls, paying up to $25,000. It positions bug bounties as a practical shakedown of safety filters and a governance input, but the announcement stays light on methods, examples, or measured effectiveness.
Function hijacking makes LLM agents pick attacker tools
Thu, Apr 23, 2026 • By Adrian Calder
New research shows an attacker can poison a tool's description to make Large Language Model (LLM) agents call that tool regardless of the prompt. Success rates hit 70–100% across several models, with universal variants working across queries. It shifts risk from prompts to tool metadata, with clear implications for agent plug-in ecosystems.
Auto-synthesised agent harnesses find Chrome zero-days
Thu, Apr 23, 2026 • By Adrian Calder
AgentFlow automates how multiple Large Language Model (LLM) agents are wired together for vulnerability discovery. By searching harness design with runtime feedback, it boosts benchmark scores and, in a Chrome case study, found ten new zero-days including two Critical sandbox escapes. Big gains, but only with source access and instrumented builds.
Z3 engine spots integer flaws in AI sandboxes
Thu, Apr 23, 2026 • By Natalie Kestrel
New research puts the focus on the cage, not the model. COBALT uses the Z3 solver to flag integer overflows, underflows and truncations in C/C++ sandbox infrastructure before deployment, with confirmed findings in major codebases. Claims around the Mythos escape remain conditional, but the exploit class looks very real.
RLVR jailbreaks fool safety checks while SFT drifts
Tue, Apr 21, 2026 • By James Armitage
New research shows three LLM jailbreak routes reach similar harmful compliance but break models in different ways. RLVR keeps capability and harm recognition intact while still complying. A short reflective prompt sharply reduces its harm. SFT causes broad drift and capability loss. Abliteration is local and repairable. One-size fixes will miss real risk.
Cross-Domain Defences Lift LLM Prompt Injection Detection
Tue, Apr 21, 2026 • By Theo Solander
New research moves beyond regex and fine-tuned classifiers to detect LLM prompt injection. It ports techniques from linguistics, bioinformatics and materials science into prompt-shield v0.4.1 (Apache 2.0). Results include an F1 jump from 0.033 to 0.378 on deepset and an 11.1-point gain on indirect attacks, with noted false-positive trade-offs.
ASMR-Bench exposes LLM blind spots in research audits
Mon, Apr 20, 2026 • By James Armitage
ASMR-Bench tests whether auditors can spot subtle sabotage in machine learning research code. Across nine codebases, both frontier Large Language Models (LLMs) and LLM-assisted humans struggled. Design-level tweaks to data, hyperparameters or evaluation often slipped through. This is a research supply-chain integrity problem, not a one-model fix.
GuardPhish shows LLMs spot phishing yet still write it
Mon, Apr 20, 2026 • By Natalie Kestrel
GuardPhish probes open-source Large Language Models (LLMs) run offline and finds a wide enforcement gap: models that flag phishing prompts still produce usable attacks. Voice scripts fare worst, with attack success up to 98.5%. The authors propose pre-generation classifiers hitting 98.27% accuracy to curb abuse without touching the base model.
Jailbreak Seeds Harmful Reasoning While Answers Stay Clean
Mon, Apr 20, 2026 • By Lydia Stratus
New research shows a jailbreak that injects harmful content into a model’s chain-of-thought while keeping the final answer unchanged. Using semantic triggers and psychology-based framing, it hits an average 83.6% success rate against several commercial Large Reasoning Models, exposing a blind spot for systems that surface or log reasoning traces.
Verifiable Gradient Inversion Breaks Federated Tabular Privacy
Fri, Apr 17, 2026 • By Lydia Stratus
A new verifiable gradient inversion attack (VGIA) shows a malicious server in federated learning can provably isolate and reconstruct exact client records, including targets, from aggregated gradients. It works fast on tabular data and avoids guesswork by certifying when a single record has been isolated. Prior geometric attacks look clumsy by comparison.
OpenAI adds sandbox and harness to Agents SDK
Thu, Apr 16, 2026 • By James Armitage
OpenAI’s Agents SDK now ships with native sandbox execution and a model‑native harness. It puts an execution boundary and policy controls closer to the Large Language Model runtime. That should lift isolation, observability and governance for long‑running agents, but it also creates fresh targets and integration risks that attackers will probe.
Claude Code Agent Design Exposes Real Attack Paths
Wed, Apr 15, 2026 • By Lydia Stratus
New analysis tears into Claude Code and OpenClaw agent architectures. The core agent loop is simple; the risk lives in permissions, classifiers, plugins, MCP servers and subagents. A patched initialisation gap shows timing bugs are real. The paper maps concrete paths for command execution abuse, stealthy subagent work, and audit gaps.
Memory-led agents jailbreak VLMs via natural images
Wed, Apr 15, 2026 • By Theo Solander
MemJack shows how memory-augmented, multi-agent systems can jailbreak Vision-Language Models using the semantics of ordinary, unmodified images. It logs a 71.48% success rate against Qwen3-VL-Plus on COCO val2017 within 20 rounds, reaching 90% at 100. Attacks are query-efficient, transfer across datasets and models, and rely heavily on persistent memory.
BadSkill poisons agent skills with embedded backdoors
Mon, Apr 13, 2026 • By Marcus Halden
New research shows agent “skills” can hide backdoors inside bundled models. BadSkill fine-tunes a small classifier to fire only when ordinary-looking parameters line up, routing calls to a hidden payload. It reaches up to 99.5% attack success across eight triggered skills with minimal poison (3% gives 91.7%), while preserving benign behaviour.
Mosaic breaks closed VLMs with multi-view ensembles
Mon, Apr 13, 2026 • By Theo Solander
New research shows a multimodal jailbreak, Mosaic, that reliably transfers to closed-source vision–language models. By blending text tweaks, multi-view image perturbations and guidance from multiple surrogates, it cuts overfitting and raises attack success on GPT-4o, Gemini 3.0 and Claude 4.5, while simple image preprocessing only blunts, not blocks, the technique.
FinSec catches delayed risks in LLM finance agents
Mon, Apr 13, 2026 • By Marcus Halden
FinSec is a four-tier detector for financial Large Language Model (LLM) agents. It combines pattern checks, generative rollouts, semantic auditing, and weighted fusion to catch delayed, multi-turn risks. On a financial dialogue benchmark, it lifts F1 to 90.13% and cuts attack success to 9.09%, without heavy utility loss.
