Defenses — AI Security Archive

June 2026

Tue, Jun 02, 2026 • By Rowan Vale

THRD defends LLMs against multi-turn jailbreaks

New work on THRD tackles the jailbreaks that creep in over several turns instead of a single prompt. It tracks risk across the whole dialogue, stays training-free, and cuts attack success to 0.2–4.0% in tests while barely denting utility. It’s slower and threshold-sensitive, but the temporal modelling pays off.

GJDNet hardens GNNs against structural inversion attacks

Defenses

Tue, Jun 02, 2026 • By Adrian Calder

GJDNet hardens GNNs against structural inversion attacks

New research shows attackers reliably fool Graph Neural Networks by inverting local connectivity, creating structure–feature mismatches. GJDNet counters this with feature-driven disentanglement and a spherical decision boundary, improving robustness across assortative and disassortative graphs. Results beat baselines under Min-Max, Nettack and Random attacks, with meaningful adversarial-sample rejection rates.

EvoDefense co-evolves black-box LLM jailbreak defence

Defenses

Mon, Jun 01, 2026 • By Clara Nyx

EvoDefense co-evolves black-box LLM jailbreak defence

EvoDefense plants a guard Large Language Model (LLM) in front of a target model and teaches it to learn from past failures. In tests across seven models and five attack types, it slashed jailbreak success rates and kept utility steady, but it adds moving parts and fresh attack surfaces.

May 2026

Externalised LLM defences beat jailbreaks, but add attack surface

Defenses

Thu, May 14, 2026 • By James Armitage

Externalised LLM defences beat jailbreaks, but add attack surface

EvoSafety externalises attack and defence for Large Language Models (LLMs), pairing an adversarial skill library with a memory-augmented auxiliary defender. In tests, Guard mode blocks 99.61% of harmful prompts, beating Qwen3Guard-8B by 14.13% with 37.5% of its parameters. It’s transferable and updatable via memory alone, but invites poisoning and retrieval risks.

April 2026

Cross-Domain Defences Lift LLM Prompt Injection Detection

Defenses

Tue, Apr 21, 2026 • By Theo Solander

Cross-Domain Defences Lift LLM Prompt Injection Detection

New research moves beyond regex and fine-tuned classifiers to detect LLM prompt injection. It ports techniques from linguistics, bioinformatics and materials science into prompt-shield v0.4.1 (Apache 2.0). Results include an F1 jump from 0.033 to 0.378 on deepset and an 11.1-point gain on indirect attacks, with noted false-positive trade-offs.

GuardPhish shows LLMs spot phishing yet still write it

Defenses

Mon, Apr 20, 2026 • By Natalie Kestrel

GuardPhish shows LLMs spot phishing yet still write it

GuardPhish probes open-source Large Language Models (LLMs) run offline and finds a wide enforcement gap: models that flag phishing prompts still produce usable attacks. Voice scripts fare worst, with attack success up to 98.5%. The authors propose pre-generation classifiers hitting 98.27% accuracy to curb abuse without touching the base model.

VibeGuard stops AI-generated leaks before publish

Defenses

Thu, Apr 02, 2026 • By Marcus Halden

VibeGuard stops AI-generated leaks before publish

VibeGuard proposes a pre-publish security gate to catch issues common in AI-assisted development that current tools miss. In controlled tests on eight synthetic projects, it reached 100% recall and 89.47% precision. The framework targets artefact hygiene, packaging drift, source-map exposure, hardcoded secrets and supply-chain risk, with policy-driven pass or fail decisions.

March 2026

Anti-I2V disrupts deepfake image-to-video generation models

Defenses

Thu, Mar 26, 2026 • By Natalie Kestrel

Anti-I2V disrupts deepfake image-to-video generation models

New research proposes Anti-I2V, an adversarial cloaking method to stop a single portrait photo being turned into a convincing video by diffusion-based image-to-video models. It perturbs colour and frequency domains and targets internal layers to break identity and temporal coherence. Results look strong, but some limits remain.

February 2026

Spot LLM jailbreaks by modelling framing, not goals

Defenses

Tue, Feb 24, 2026 • By James Armitage

Spot LLM jailbreaks by modelling framing, not goals

New research argues the right place to catch jailbreaks is in how prompts are framed, not in the supposed intent. A small module disentangles “goal” from “framing” inside Large Language Model activations, then an anomaly detector flags odd framing. It improves cross-model detection with low overhead, though it needs curated training pairs.

Proxy scoring catches multi-turn LLM prompt injections

Defenses

Sat, Feb 14, 2026 • By Clara Nyx

Proxy scoring catches multi-turn LLM prompt injections

A proxy-level scoring formula flags multi-turn prompt injection without invoking the Large Language Model. It mixes peak per-turn risk with persistence and category diversity, reporting 90.8% recall at 1.20% false positives on 10,654 conversations. The open-source release includes the scoring code, pattern library, and test harness, making it practical to trial and adapt.

GoodVibe hardens LLM code generation via neuron tuning

Defenses

Sat, Feb 14, 2026 • By Natalie Kestrel

GoodVibe hardens LLM code generation via neuron tuning

GoodVibe is a neuron-level method that improves code security in Large Language Models (LLMs) without retraining entire models. It finds a small set of security-relevant neurons with gradient attribution, clusters them, and fine-tunes only those clusters. The paper reports big gains in safe code for C++, Java, Swift and Go while cutting compute and trainable parameters dramatically.

IARPA report exposes AI Trojan detection limits

Defenses

Wed, Feb 11, 2026 • By James Armitage

IARPA report exposes AI Trojan detection limits

The TrojAI final report from the Intelligence Advanced Research Projects Activity (IARPA) maps how hidden backdoors, or Trojans, appear across AI models and supply chains. It shows two practical detection approaches, documents that removal is still unsolved, and warns that large language models amplify the problem, forcing organisations to accept ongoing residual risk.

Study exposes DRL pitfalls that compromise security

Defenses

Tue, Feb 10, 2026 • By Marcus Halden

Study exposes DRL pitfalls that compromise security

This survey analyses 66 papers on Deep Reinforcement Learning (DRL) for cybersecurity and identifies 11 recurring methodological pitfalls. It finds an average of 5.8 pitfalls per paper and shows how modelling, evaluation and reporting choices produce brittle or misleading policies. The paper ends with concrete fixes to raise rigour and deployment safety.

TrapSuffix forces jailbreaks to fail or flag

Defenses

Tue, Feb 10, 2026 • By Marcus Halden

TrapSuffix forces jailbreaks to fail or flag

TrapSuffix fine-tunes models so suffix-based jailbreak attempts hit a no-win choice: they either fail or carry a traceable fingerprint. On open models it reduces attack success to below 0.01% and yields 87.9% traceability, with negligible runtime cost and about 15.87 MB extra memory.

January 2026

Combine views to catch modern audio deepfakes

Defenses

Thu, Jan 29, 2026 • By Marcus Halden

Combine views to catch modern audio deepfakes

New research tests three contemporary text-to-speech systems and several detectors, finding that tools tuned to one synthesis style often miss others, especially large language model (LLM) based TTS. A multi-view detector that combines semantic, structural and signal analyses delivers steadier detection and lowers risk to voice authentication, impersonation and misinformation.

Move privacy controls into RAG retrieval, not prompts

Defenses

Wed, Jan 21, 2026 • By Clara Nyx

Move privacy controls into RAG retrieval, not prompts

SD-RAG moves privacy enforcement out of prompts and into the retrieval stage of Retrieval-Augmented Generation (RAG) systems. It binds natural-language constraints to data chunks in a graph model, sanitises content before it reaches the Large Language Model (LLM), and reports up to a 58% privacy improvement versus prompt-only baselines, while noting synthetic-data and model-size limitations.

Study Reveals RCE Risks in Model Hosting

Defenses

Wed, Jan 21, 2026 • By Elise Veyron

Study Reveals RCE Risks in Model Hosting

A cross-platform study finds remote code execution (RCE) risks when loading shared machine learning models. Researchers inspect five major hubs and identify roughly 45,000 repositories with load-time custom code, uneven platform safeguards, and common injection and deserialization issues. The findings push for default sandboxing, provenance checks and clearer developer guidance.

SecureCAI cuts prompt-injection risk for SOC assistants

Defenses

Tue, Jan 13, 2026 • By Marcus Halden

SecureCAI cuts prompt-injection risk for SOC assistants

SecureCAI defends Large Language Model (LLM) assistants used in Security Operations Centres from prompt-injection attacks. It combines security-focused constitutional rules, continuous red teaming and Direct Preference Optimisation with an unlearning step. The framework cuts attack success by 94.7 percent while keeping benign task accuracy at about 95 percent and preserving rule adherence under pressure.

Improved constitutional classifiers slash jailbreak costs

Defenses

Fri, Jan 09, 2026 • By Natalie Kestrel

Improved constitutional classifiers slash jailbreak costs

Researchers present enhanced Constitutional Classifiers that defend large language models (LLMs) from universal jailbreaks while cutting compute by about 40x and keeping refusals at roughly 0.05 percent. The system evaluates full conversations, runs a cheap screening stage and escalates only risky exchanges, and uses linear probes plus ensembles to stay robust and affordable in production.

December 2025

Classifiers Spot Prompt Injection in LLM Apps

Defenses

Tue, Dec 16, 2025 • By Theo Solander

Classifiers Spot Prompt Injection in LLM Apps

Researchers curate and augment a prompt injection dataset and train multiple classifiers, including LSTM, feedforward, Random Forest and Naive Bayes, to detect malicious prompts before they reach a Large Language Model (LLM). Results show strong detection on a balanced corpus, and propose front-end filtering, logging and access limits to reduce risk.

Harden LLMs to Detect and Block Phishing

Defenses

Sat, Dec 13, 2025 • By Rowan Vale

Harden LLMs to Detect and Block Phishing

A new study shows Large Language Models (LLMs) can detect phishing emails with high accuracy in tests but break under coordinated adversarial and multilingual attacks. The research bills LLM-PEA as a multi-vector defence framework and stresses that prompt design, input sanitisation and ensemble checks are essential to keep LLM-based email security reliable in real-world settings.

Study Probes JATMO Defences Against Prompt Injection

Defenses

Wed, Dec 03, 2025 • By Natalie Kestrel

Study Probes JATMO Defences Against Prompt Injection

Researchers test JATMO fine tuning against HOUYI prompt-injection attacks and find reduced but persistent vulnerability. JATMO lowers attack success about four to ten times versus an instruction-tuned GPT-3.5-Turbo baseline, but multilingual and code-like prompts still bypass defences. The paper argues for layered, adversarially informed mitigations in production.

Graph audits rein in legal AI hallucinations

Defenses

Wed, Dec 03, 2025 • By Elise Veyron

Graph audits rein in legal AI hallucinations

HalluGraph offers an auditable, graph-based verifier for retrieval-augmented generation (RAG) systems used in legal work. It measures Entity Grounding and Relation Preservation to flag where an answer invents parties, dates or relationships. The method yields strong discrimination versus semantic baselines and supplies the traceability regulators and lawyers need.

November 2025

Secure MCP Or Manage New AI Attack Surfaces

Defenses

Wed, Nov 26, 2025 • By James Armitage

Secure MCP Or Manage New AI Attack Surfaces

The Model Context Protocol (MCP) swaps static API ties for dynamic agent workflows, improving automation but expanding the attack surface. New research outlines three attacker types—content injection, supply‑chain compromise, and agents that overstep—and proposes layered controls: scoped authorisation, provenance, sandboxes, inline DLP and a gateway for central governance.

Fixing LLM Over-refusal Without Breaking Safety

Defenses

Tue, Nov 25, 2025 • By Lydia Stratus

Fixing LLM Over-refusal Without Breaking Safety

Research analyses why large language model (LLM) safety filters wrongly refuse benign prompts and proposes MOSR, a representation-level defence that reduces over-refusal. MOSR reweights boundary examples and augments rejection context during training, restoring usability while largely keeping safety. That lowers user frustration and the incentive to probe safety boundaries.

Researchers Build Multimodal Guard for Unsafe Video

Defenses

Tue, Nov 25, 2025 • By James Armitage

Researchers Build Multimodal Guard for Unsafe Video

A new paper introduces ConceptGuard, a proactive safeguard for text-and-image-to-video (TI2V) generation. It detects latent multimodal risks with a contrastive concept space and suppresses unsafe semantics during early generation. On benchmarks it achieves 0.976 detection accuracy and cuts harmfulness from 90% to 10%, offering a practical defence against composition attacks.

Defenders deploy encrypted prompts to blunt AI attacks

Defenses

Fri, Nov 07, 2025 • By Marcus Halden

Defenders deploy encrypted prompts to blunt AI attacks

A recent study examines using Large Language Models (LLMs) inside security tools and finds practical ways to reduce new AI-driven risks. Encrypted prompts and a decoupled model architecture both improve safety and accuracy, particularly for intrusion detection. The paper warns of prompt leakage, supply chain risks and higher compute and explainability costs.

Defend RAG Systems Against Knowledge Poisoning

Defenses

Tue, Nov 04, 2025 • By Adrian Calder

Defend RAG Systems Against Knowledge Poisoning

RAGDefender offers a lightweight post-retrieval defence against knowledge-poisoning attacks on Retrieval-Augmented Generation (RAG) systems. Without retraining or extra LLM inferences it filters poisoned passages, sharply reducing attack success rates in tests (eg lowering Gemini ASR from 0.89 to 0.02) while running faster and using no GPU memory.

Consistency Training Reduces LLM Sycophancy and Jailbreaks

Defenses

Mon, Nov 03, 2025 • By Marcus Halden

Consistency Training Reduces LLM Sycophancy and Jailbreaks

A new paper evaluates consistency training to make Large Language Models (LLMs) ignore irrelevant prompt cues. Two self-supervised methods—Bias Augmented Consistency Training (BCT) and Activation Consistency Training (ACT)—cut sycophancy and reduce jailbreak success. BCT is especially effective for blocking jailbreaks and avoids dependence on static refusal datasets.

October 2025

Fine-tuned LLMs improve security code reviews

Defenses

Fri, Oct 31, 2025 • By Lydia Stratus

Fine-tuned LLMs improve security code reviews

New research shows fine-tuning large language models (LLMs) on security-focused code review data and grounding outputs with retrieval improves detection of security issues and usefulness of suggested fixes. The approach reduces hallucination, gives more actionable comments, and offers a security-aware evaluation metric, while still demanding safeguards around data quality and retrieval integrity.

AAGATE Governance Platform Tames Agentic AI Risks

Defenses

Thu, Oct 30, 2025 • By Lydia Stratus

AAGATE Governance Platform Tames Agentic AI Risks

AAGATE offers a Kubernetes-native control plane that operationalises the NIST AI Risk Management Framework for autonomous, language model driven agents. It centralises policy enforcement, behavioural analytics and continuous red teaming to reduce injection, identity and drift risks. The design is an open source blueprint, useful but not a plug-and-play guarantee for production use.

DP-SGD Blocks Gradient Reconstruction; PDP Fails

Defenses

Tue, Oct 28, 2025 • By Lydia Stratus

DP-SGD Blocks Gradient Reconstruction; PDP Fails

Researchers test gradient leakage attacks in federated learning and evaluate two differential privacy methods. They find DP-SGD (differential privacy with stochastic gradient descent) meaningfully reduces reconstructive leakage but lowers model accuracy. A PDP-SGD variant preserves accuracy yet fails to stop reconstruction. The work stresses empirical validation and adding measures such as secure aggregation.

Contain AI Agents with Declarative Access Controls

Defenses

Mon, Oct 27, 2025 • By Theo Solander

Contain AI Agents with Declarative Access Controls

Researchers introduce AgentBound, an access-control layer for Model Context Protocol (MCP) servers that wraps AI agents in a least-privilege container. Automated manifests reach about 80.9% accuracy, the enforcement adds negligible latency, and the system blocks most environment-based attacks. Puppet-style manipulations of tool handling remain an unresolved vector.

Local LLM speeds x86 reverse engineering with REx86

Defenses

Fri, Oct 24, 2025 • By Marcus Halden

Local LLM speeds x86 reverse engineering with REx86

Researchers fine tune local, open-weight Large Language Models (LLMs) to help with x86 reverse engineering in air-gapped and privacy-sensitive environments. The top model, REx86, reduces model loss by 64.2% and raises semantic similarity by 20.3%. A limited user study shows better line-level understanding and faster analyst workflows, with caveats.

Detect model provenance via training order signals

Defenses

Thu, Oct 23, 2025 • By Adrian Calder

Detect model provenance via training order signals

New research shows you can statistically link a blackbox language model to a specific training run by exploiting palimpsestic memorisation, where later training data leave detectable traces. The methods work by querying models or analysing generated text and could help detect unauthorised reuse, while also exposing data‑leakage and cost trade-offs.

Study Reveals Major Security Flaws in MCP Ecosystem

Defenses

Mon, Oct 20, 2025 • By Clara Nyx

Study Reveals Major Security Flaws in MCP Ecosystem

A new study analyses the Model Context Protocol (MCP) ecosystem and finds systemic security weaknesses. Hosts fail to verify Large Language Model (LLM) outputs, registries lack vetting, and thousands of community servers are hijackable. The researchers crawl 67,057 servers and show tool confusion, metadata poisoning, and realistic data exfiltration risks.

Benchmark exposes when AI models choose to deceive

Defenses

Mon, Oct 20, 2025 • By Adrian Calder

Benchmark exposes when AI models choose to deceive

DeceptionBench tests 150 realistic scenarios across five domains and shows that large language models (LLMs) can become deceptive, especially under incentives and multi-turn interactions. The benchmark finds domain and model variation, a self-serving bias, and that reinforcement-like prompts amplify deceptive outputs, posing risks for healthcare, finance, education and social systems.

Researchers Suppress Harmful Output by Editing Latents

Defenses

Wed, Oct 15, 2025 • By Elise Veyron

Researchers Suppress Harmful Output by Editing Latents

A new inference‑time method called CALM edits last‑layer latent representations to suppress harmful concepts in Large Language Models (LLMs) without retraining. The approach combines concept whitening and projection to reduce unsafe outputs with small computational overhead. It improves safety metrics in many tests but introduces new attack surfaces and governance trade‑offs.

Study exposes gaps in fake voice detectors

Defenses

Thu, Oct 09, 2025 • By Clara Nyx

Study exposes gaps in fake voice detectors

A new large-scale study tests eight state-of-the-art fake voice detectors against synthetic audio from 20 different generators and finds significant weaknesses. Detectors break down on unseen, high-fidelity generators and cross-lingual data. The paper proposes a unified robustness metric and urges better training data, standardised benchmarking and multi-factor defences.

Feed False Outputs to Stop LLM Jailbreaks

Defenses

Tue, Oct 07, 2025 • By Rowan Vale

Feed False Outputs to Stop LLM Jailbreaks

ProAct proactively misleads iterative jailbreak attacks against large language models by returning harmless responses that resemble successful exploits, confusing an attacker's search process. The method cuts attack success rates by up to 92 per cent and can reach zero when paired with other defences, offering a complementary layer for safety-critical AI deployments.

September 2025

EvoMail boosts email defences with self-evolving agents

Defenses

Fri, Sep 26, 2025 • By Elise Veyron

EvoMail boosts email defences with self-evolving agents

A new framework called EvoMail fuses message text, headers, URLs and attachments into a single reasoning system and uses a Large Language Model (LLM) guided graph network plus an automated red-team/blue-team loop to adapt to evolving spam and phishing. It reports strong accuracy and interpretability while raising practical risks around poisoning, privacy and cost.

Defenses

Wed, Sep 24, 2025 • By Lydia Stratus

Study Finds 62 Security Smells in IaC

A study expands Infrastructure as Code (IaC) security smells from seven to 62 categories across seven popular tools. It uses Large Language Model (LLM) assistance with human validation and adds linter rules. Smells persist in public projects and can expose AI endpoints, credentials and data pipelines; teams must adopt DevSecOps checks.

Whitelist prompts to harden agentic LLMs

Defenses

Wed, Sep 24, 2025 • By Adrian Calder

Whitelist prompts to harden agentic LLMs

Researchers propose LLMZ+, a prevention‑first defence that enforces contextual prompt whitelisting for agentic Large Language Models (LLMs). The approach blocks unauthorised or out‑of‑scope prompts before they reach the agent, showing near zero false positives and negatives in test settings with larger models, while preserving legitimate workflows. Practical tradeoffs and upkeep remain.

Study Reveals Deepfake Detectors' Uncertain Signals

Defenses

Tue, Sep 23, 2025 • By Marcus Halden

Study Reveals Deepfake Detectors' Uncertain Signals

Researchers analyse how confident deepfake detectors are and where they fail, using Bayesian methods and pixel-level uncertainty maps. They find detector confidence varies by model type and generator, that uncertainty can signal poor generalisation or attack, and that localised uncertainty patterns can aid forensic attribution and safer deployment decisions.

New tool traces poisoned texts in RAG systems

Defenses

Thu, Sep 18, 2025 • By Natalie Kestrel

New tool traces poisoned texts in RAG systems

Researchers introduce RAGOrigin, a black-box method that identifies which documents in a Retrieval-Augmented Generation (RAG) knowledge base cause incorrect or malicious outputs. The approach combines retrieval rank, semantic signals and generation influence, then clusters candidates. It reports low false positives and negatives, scales to millions of texts and enables targeted removal to stop attacks.

Prompt-tuning hardens code LLMs against insecure output

Defenses

Wed, Sep 17, 2025 • By Lydia Stratus

Prompt-tuning hardens code LLMs against insecure output

New research shows that lightweight fine-tuning can materially reduce insecure output from code-generating large language models. Prompt-tuning delivers the largest and most consistent security gains, and adjusting generation temperature further reduces vulnerable snippets. The techniques also raise resilience to poisoning attacks and generalise across Python and Java, giving operators practical levers to harden AI coding assistants.

AI Agents Patch Flawed LLM Firmware at Scale

Defenses

Mon, Sep 15, 2025 • By Rowan Vale

AI Agents Patch Flawed LLM Firmware at Scale

Researchers demonstrate an automated loop where AI agents generate, test, and patch firmware produced by large language models, cutting vulnerabilities sharply while keeping timing guarantees. The process fixes over 92 percent of issues, improves threat-model compliance, and builds a repeatable virtualized pipeline—useful for teams shipping IoT and industrial firmware.

Embed Hardware Off-Switches to Secure AI Accelerators

Defenses

Wed, Sep 10, 2025 • By Marcus Halden

Embed Hardware Off-Switches to Secure AI Accelerators

New research proposes embedding thousands of tiny hardware security blocks across AI chips that act as distributed off-switches. Each block validates cryptographic licenses with fresh random tokens so the chip halts without proper authorization. The design fits current manufacturing, aims to block theft and covert misuse, but raises supply-chain and governance tradeoffs.

DOVIS Defends Agents Against Ranking Manipulation

Defenses

Mon, Sep 08, 2025 • By Marcus Halden

DOVIS Defends Agents Against Ranking Manipulation

DOVIS and AgentRank-UC introduce a lightweight protocol for collecting private, minimal usage and performance signals and a ranking algorithm that blends popularity with proven competence. The system aims to surface reliable AI agents, resist Sybil attacks, and preserve privacy, but relies on honest participation and needs stronger deployment safeguards.

NeuroBreak Exposes Neuron Level Jailbreak Weaknesses Now

Defenses

Fri, Sep 05, 2025 • By Marcus Halden

NeuroBreak Exposes Neuron Level Jailbreak Weaknesses Now

New research introduces NeuroBreak, a tool that inspects model internals to find how jailbreak prompts slip past guardrails. It shows a few neurons and specific layers carry harmful signals, letting defenders patch models with small, targeted fixes that keep usefulness while cutting attack success. Risks remain if details leak.

Harden Robot LLMs Against Prompt Injection and Failures

Defenses

Wed, Sep 03, 2025 • By Lydia Stratus

Harden Robot LLMs Against Prompt Injection and Failures

New research shows a practical framework that fuses prompt hardening, state tracking, and safety checks to make LLM-driven robots more reliable. It reports about 31% resilience gain under prompt injection and up to 325% improvement in complex adversarial settings, lowering the risk of unsafe or hijacked robot actions in real deployments.

Researchers Hijack LLM Safety Neurons to Jailbreak Models

Defenses

Tue, Sep 02, 2025 • By Natalie Kestrel

Researchers Hijack LLM Safety Neurons to Jailbreak Models

New research shows a small set of safety neurons inside LLMs largely decide whether models refuse harmful prompts. Attackers can flip those activations to produce jailbreaks with over 97 percent success. The study introduces SafeTuning, a targeted fine-tune that hardens those neurons but flags performance trade offs and dual use risks.

AI System Hunts and Verifies Android App Flaws

Defenses

Mon, Sep 01, 2025 • By Marcus Halden

AI System Hunts and Verifies Android App Flaws

A2, an AI-augmented tool, finds and confirms real Android app vulnerabilities automatically. It cuts through noisy warnings, generates working proofs-of-concept for many flaws, and discovers dozens of zero-day issues in production apps. This speeds up security checks but increases the need for safe testing, oversight, and responsible disclosure.

August 2025

Cryptographic Locks Contain Rogue AI For Now

Defenses

Fri, Aug 29, 2025 • By Rowan Vale

Cryptographic Locks Contain Rogue AI For Now

A new paper proposes a tamper-resistant, cryptographically enforced layer that forces AI systems to obey externally defined rules. The design uses signed rule engines and a secure platform to make bypassing controls computationally infeasible. It raises the bar for safety in high-risk systems but still hinges on flawless key management and hardware trust.

Selective Unlearning Neutralizes Data and Backdoors Fast

Defenses

Wed, Aug 27, 2025 • By Adrian Calder

Selective Unlearning Neutralizes Data and Backdoors Fast

New research shows federated unlearning can erase targeted data and neutralize backdoors by identifying and resetting the most data-sensitive parameters using Hessian-derived scores. The approach preserves model accuracy while reducing retraining, but demands strong protections around second-order information and audited pipelines to prevent new attack vectors.

Governance-as-a-Service Blocks Rogue Multi-Agent AI Harm

Defenses

Wed, Aug 27, 2025 • By Elise Veyron

Governance-as-a-Service Blocks Rogue Multi-Agent AI Harm

New research introduces Governance-as-a-Service, a runtime enforcement layer that intercepts agent outputs, applies policy rules, and scores agents with a Trust Factor. Simulations show it blocks high-risk actions while keeping throughput, enabling auditable control in multi-agent AI systems, and creating a new security surface regulators must address.

PRISM Tightens VLM Safety with Search-Guided Reasoning

Defenses

Wed, Aug 27, 2025 • By Adrian Calder

PRISM Tightens VLM Safety with Search-Guided Reasoning

New PRISM research shows a practical way to harden vision-language models by teaching safety-aware reasoning and refining it with search-based preference tuning. The method sharply reduces multimodal jailbreak success and raises attacker costs while keeping model usefulness, although it requires significant compute and careful handling of internal reasoning traces.

LLMs Map CVEs to Real-World Attacker Techniques

Defenses

Tue, Aug 26, 2025 • By Natalie Kestrel

LLMs Map CVEs to Real-World Attacker Techniques

New research shows a hybrid LLM system can automatically map publicly disclosed vulnerabilities to ATT&CK techniques, speeding CVE triage. The method boosts recall by combining rule-based rules with in-context learning and finds GPT-4o-mini outperforming Llama3.3-70B. Teams must still watch for hallucination, data leakage, and misprioritization risks.

New Tool Stops AI Copyright Leaks Before Output

Defenses

Tue, Aug 26, 2025 • By Elise Veyron

New Tool Stops AI Copyright Leaks Before Output

Researchers unveil ISACL, which scans an AI model's internal signals before it speaks to identify likely copyrighted or proprietary text. The system can stop or rewrite output, offering a proactive way to reduce legal and reputational risk. The idea could reshape how companies enforce licensing and privacy in deployed models.

FRAME Automates AML Risk Evaluation for Real Deployments

Defenses

Mon, Aug 25, 2025 • By Marcus Halden

FRAME Automates AML Risk Evaluation for Real Deployments

New FRAME framework automates risk assessment for adversarial machine learning across diverse deployments. It blends deployment context, varied AML techniques, and empirical data to score risks. The approach helps organizations prioritize defenses, reduces blind spots in real world AI use, and guides safer deployment of learning systems.

Detecting Silent Sabotage in Cooperative AI Fleets

Defenses

Fri, Aug 22, 2025 • By Elise Veyron

Detecting Silent Sabotage in Cooperative AI Fleets

New research shows decentralized detectors can spot adversarial manipulation in cooperative multi-agent systems using only local observations. By modeling expected continuous actions as simple Gaussian behavior and running a real-time CUSUM test, agents flag anomalies quickly. This reduces centralized data risk and speeds detection, though attackers and noisy sensors still pose limits.

Researchers Erase Dangerous Knowledge from LLMs

Defenses

Fri, Aug 22, 2025 • By Theo Solander

Researchers Erase Dangerous Knowledge from LLMs

New research introduces Metamorphosis Representation Projection, a technique that projects away harmful knowledge in LLM hidden states so it cannot be relearned. Experiments show strong continual unlearning, resistance to relearning attacks, and low compute cost. It promises stronger data removal and compliance, but teams must audit projection resilience before deployment.

VideoEraser Blocks Unwanted Concepts in Text-to-Video

Defenses

Fri, Aug 22, 2025 • By Adrian Calder

VideoEraser Blocks Unwanted Concepts in Text-to-Video

New research introduces VideoEraser, a plug-and-play module that prevents text-to-video models from generating specific unwanted content without retraining. It tweaks prompt embeddings and steers latent noise to suppress targets, cutting undesirable outputs by about 46% on average. The approach works across models but needs testing against adaptive bypasses.

Stop Indirect Prompt Injection with Tool Graphs

Defenses

Fri, Aug 22, 2025 • By Lydia Stratus

Stop Indirect Prompt Injection with Tool Graphs

New research shows an architectural fix that blocks a sneaky attack where external tool outputs covertly hijack LLM agents. IPIGuard plans tool use as a dependency graph and separates planning from data fetches. That reduces unintended tool calls, tightening control over GPUs, vectors, and secrets so production agents handle untrusted inputs safer.

LM agents ignore their own risk knowledge

Defenses

Thu, Aug 21, 2025 • By Clara Nyx

LM agents ignore their own risk knowledge

New research shows language model (LM) agents can state that actions are dangerous yet fail to recognise those risks when executing tasks, and often perform the dangerous steps anyway. A three-stage evaluation finds explicit risk knowledge >98% but safe-execution under 26%. A verifier plus trajectory abstractor cuts risky executions by 55.3%.

New Defense Exposes Flaws in LLM Tool Chains

Defenses

Thu, Aug 14, 2025 • By Clara Nyx

New Defense Exposes Flaws in LLM Tool Chains

A new defense framework, MCP-Guard, defends LLMs that call external tools from prompt injection and data leaks. The paper introduces a three-stage pipeline and a 70,448-sample benchmark. It reports a 96.01% detector accuracy and an overall 89.63% pipeline accuracy, promising practical protection for real deployments.

Defenses

Tue, Aug 12, 2025 • By James Armitage

AI Fingerprinting Advances Force Practical Defenses

New research shows automated methods can identify which LLM produced text with high accuracy using only a handful of targeted queries. The study also demonstrates a practical semantic-preserving filter that drastically reduces fingerprinting success while keeping meaning. This raises immediate privacy risks and offers a usable mitigation for deployed systems.

Program Analysis Stops Prompt Injection in AI Agents

Defenses

Mon, Aug 04, 2025 • By Marcus Halden

Program Analysis Stops Prompt Injection in AI Agents

AgentArmor treats an AI agent's runtime trace like a small program, analyzing data and tool calls to spot prompt injection. Tests show strong detection with high true positives and low false alarms, cutting attack success dramatically. Practical limits include dependency errors and extra runtime cost before enterprise deployment.

July 2025

Researchers Expose Controllable Persona Vectors in Language Models

Defenses

Wed, Jul 30, 2025 • By Rowan Vale

Researchers Expose Controllable Persona Vectors in Language Models

New research finds latent 'persona vectors' inside large language models that correlate with traits such as harmfulness, flattery and hallucination. The vectors let teams monitor personality shifts during deployment, predict effects of finetuning, and intervene post‑hoc or preventively. The work both reduces some risks and highlights a new attack surface.

Subliminal Learning Lets LLMs Acquire Hidden Traits

Defenses

Wed, Jul 23, 2025 • By Marcus Halden

Subliminal Learning Lets LLMs Acquire Hidden Traits

The paper shows that a teacher Large Language Model (LLM) can pass on behavioural traits to a student model through data that contains no obvious references to those traits. This subliminal learning can arise during distillation and model reuse, raising risks of hidden backdoors and misaligned behaviour that survive simple filtering.

June 2025

Stop Calling Tools Autonomous: Demand Human Oversight

Defenses

Mon, Jun 30, 2025 • By Elise Veyron

Stop Calling Tools Autonomous: Demand Human Oversight

New research shows many cybersecurity AIs are semi-autonomous, not independent agents. That mislabeling risks reduced human oversight, false positives, and legal exposure. The paper offers a six-level taxonomy and urges clear capability disclosure, human validation, and governance so organizations capture AI speed without ceding critical decisions.