2026 Archive

April 2026

Mon, Apr 20, 2026 • By Natalie Kestrel

GuardPhish shows LLMs spot phishing yet still write it

GuardPhish probes open-source Large Language Models (LLMs) run offline and finds a wide enforcement gap: models that flag phishing prompts still produce usable attacks. Voice scripts fare worst, with attack success up to 98.5%. The authors propose pre-generation classifiers hitting 98.27% accuracy to curb abuse without touching the base model.

Jailbreak Seeds Harmful Reasoning While Answers Stay Clean

Attacks

Mon, Apr 20, 2026 • By Lydia Stratus

Jailbreak Seeds Harmful Reasoning While Answers Stay Clean

New research shows a jailbreak that injects harmful content into a model’s chain-of-thought while keeping the final answer unchanged. Using semantic triggers and psychology-based framing, it hits an average 83.6% success rate against several commercial Large Reasoning Models, exposing a blind spot for systems that surface or log reasoning traces.

Verifiable Gradient Inversion Breaks Federated Tabular Privacy

Attacks

Fri, Apr 17, 2026 • By Lydia Stratus

Verifiable Gradient Inversion Breaks Federated Tabular Privacy

A new verifiable gradient inversion attack (VGIA) shows a malicious server in federated learning can provably isolate and reconstruct exact client records, including targets, from aggregated gradients. It works fast on tabular data and avoids guesswork by certifying when a single record has been isolated. Prior geometric attacks look clumsy by comparison.

April 2026

GuardPhish shows LLMs spot phishing yet still write it

Jailbreak Seeds Harmful Reasoning While Answers Stay Clean

Verifiable Gradient Inversion Breaks Federated Tabular Privacy

OpenAI adds sandbox and harness to Agents SDK

Claude Code Agent Design Exposes Real Attack Paths

Memory-led agents jailbreak VLMs via natural images

BadSkill poisons agent skills with embedded backdoors

Mosaic breaks closed VLMs with multi-view ensembles

FinSec catches delayed risks in LLM finance agents

Malicious LLM routers hijack agents via tool-call rewrites

Harmless UI overlays misdirect GUI agents

Backdoored skills let agents rebuild hidden payloads

SkillSieve triages malicious agent skills with LLM jury

OpenClaw agent state poisoning punches through safety

Covert Key Exchange Lets LLM Agents Evade Audits

Zero-knowledge proofs police risky LLM fine-tuning

Puzzle Prompts Make LLM Agents Exploit Vulnerabilities

Causal guardrails tame overzealous cyber defence agents

Agent Traps Map the Real Attack Surface for LLMs

Benchmarking OpenClaw Agents Exposes Lifecycle Security Risks

SkillAttack maps prompt-only exploits in agent skills

Environment-injected memory poisoning trips LLM web agents

LLMs Tackle Hardware Security Verification, With Evidence

Google outlines continuous defences for indirect prompt injection

AgentWatcher spots prompt injection in LLM agents

Stress tests show LLM ethics degrade under pressure

VibeGuard stops AI-generated leaks before publish

Secure AI Agents Need Dynamic Plans and Policies

Hidden visual prompts steer multimodal LLMs

March 2026

Make AI Agents Accountable on Real Machines

GrantBox tests LLM agents with real-world privileges

Why Classifier Gates Fall Short for Safe AI Upgrades

Agents cut re-ID in street images without the cloud

Survey maps multimodal LLM attacks to operational reality

OpenClaw advisories expose cross-layer RCE in LLM agents

OpenClaw advisories expose brittle AI agent controls

Real-time monitor spots LLM reasoning failures

Compound attack steers RAG via injection and poisoning

Compound attack hijacks RAG with prompt injection

Watcher-led defence hardens OpenClaw autonomous agents

Steering Vectors Upend LLM Jailbreak Resistance

Autoresearch agent finds stronger LLM jailbreak attacks

Anti-I2V disrupts deepfake image-to-video generation models

LLM agents automate ROS pentesting with graph memory

Token-aware fuzzing slashes LLM jailbreak queries

Study maps agentic AI attack surface and risks

Auditing MCP agents for over-privileged tool access

Comics jailbreak multimodal LLMs at high success rates

Finetuning Makes Aligned LLMs Regurgitate Copyrighted Books

NASimJax speeds RL pentesting but exposes brittle methods

Guidance Injection Steers LLM Agents to Harm

EvoJail automates long-tail LLM jailbreaks at scale

Joint audio-text attacks jailbreak spoken models

Confirmation Bias Lets Malicious PRs Evade LLM Review

Local LLM agent solves Linux privilege escalation

Red team shows LLM agents hide injected actions

LAAF automates multi-stage prompt injection against agents

Healthcare deployment cages LLM agents with zero trust

REFORGE breaks concept unlearning with image-based red teaming

ClawWorm shows LLM agents self-propagate across peers

LLM scanners mislead when their judges disagree

OpenClaw Case Study Exposes Real Risks in AI Agents

PISmith uses RL to break prompt-injection defences

OpenClaw PRISM hardens tool-using LLM agent runtime

LLM agents break trust boundaries; favour deterministic controls

Constrain agents, protect data, blunt prompt injection

Structured visuals quietly jailbreak LVLMs with slot filling

Framework curbs agentic LLM risks in enterprise SOC

GenAI speeds pentests of consumer robots, exposes fleets

Layered Controls Catch LLM Agent Tool Abuse

Codex Security touts end-to-end AI patching agent

Rebuilding Agent Flows to Catch LLM Attack Chains

LLM Agents Cheat When Survival Is Threatened

LLM Safety Backfires: Transferable RAG Blocking Works

Benchmark tests LLM agents for container escapes

RCR shows LLMs assist Active Directory pentests

Scores Drive Voice Impersonation on Speaker Recognition