ShortSpan.ai brings you rapid news on AI security research & real-world impacts.
Filter by category:
Featured
Agents
Game-Theory Layer Boosts AI Penetration Testing
By Theo Solander
Researchers introduce Generative Cut-the-Rope (G-CTR), a game-theoretic guidance layer that turns AI agent logs into attack graphs and computes effort-aware Nash equilibria. Feeding a concise digest back to the Large Language Model (LLM) loop doubles success rates in a cyber range, cuts costs and reduces behavioural variance, while raising clear tampering and dual-use risks.
Researchers show a training-free attack called Large Language Lobotomy (L3) that bypasses safety in mixture-of-experts (MoE) large language models by silencing a small set of experts. On eight open-source MoE models, L3 raises average attack success from 7.3% to 70.4%, often needing under 20% expert silencing while preserving utility.
Defenses
TrapSuffix forces jailbreaks to fail or flag
Tue, Feb 10, 2026 • By Dr. Marcus Halden
TrapSuffix fine-tunes models so suffix-based jailbreak attempts hit a no-win choice: they either fail or carry a traceable fingerprint. On open models it reduces attack success to below 0.01% and yields 87.9% traceability, with negligible runtime cost and about 15.87 MB extra memory.
Attacks
Confundo Crafts Robust Poisons for RAG Systems
Mon, Feb 09, 2026 • By Natalie Kestrel
New research presents Confundo, a learning-to-poison framework that fine-tunes a large language model (LLM) to generate stealthy, robust poisoned content for retrieval-augmented generation (RAG) systems. Confundo survives realistic preprocessing and varied queries, manipulates facts, biases opinions and induces hallucinations while exposing gaps in ingestion, provenance and defensive testing.
Researchers describe BadTemplate, a training-free backdoor that hides malicious instructions inside chat templates used with Large Language Models (LLMs). The attack injects strings into the system prompt, produces persistent model misbehaviour across sessions and models, and evades common detectors, creating a scalable supply chain risk for AI-driven systems.
Attacks
Researchers expose inference-time backdoors in chat templates
Thu, Feb 05, 2026 • By Natalie Kestrel
New research shows attackers can hide backdoors inside chat templates used with open-weight Large Language Models (LLMs). Templates can trigger malicious instructions at inference time without altering model weights or data. The backdoors silently break factual accuracy or inject attacker-chosen links, work across runtimes, and evade current automated distribution scans.
Attacks
Narrative Speech Evades Audio-Language Model Safeguards
Wed, Feb 04, 2026 • By Natalie Kestrel
Researchers demonstrate that narrative-style spoken prompts significantly increase jailbreak success against large audio-language models. Stylised synthetic speech raises attack rates substantially — with one result hitting 98.26% — and outperforms text-only attempts. The work warns that voice interfaces in assistants, education and clinical triage need multimodal safety checks that include prosody and delivery.
Agents
Open LLM RedSage Bolsters Local Cybersecurity Assistants
Fri, Jan 30, 2026 • By Theo Solander
RedSage is an open, locally deployable Large Language Model (LLM) trained on cybersecurity data and simulated expert workflows. At the 8B scale it measurably improves benchmark performance. The release promises practical defensive assistance but highlights dual-use, data leakage and poisoning risks and calls for strict safety, provenance and access controls.
Defenses
Combine views to catch modern audio deepfakes
Thu, Jan 29, 2026 • By Dr. Marcus Halden
New research tests three contemporary text-to-speech systems and several detectors, finding that tools tuned to one synthesis style often miss others, especially large language model (LLM) based TTS. A multi-view detector that combines semantic, structural and signal analyses delivers steadier detection and lowers risk to voice authentication, impersonation and misinformation.
Agents
Diagnose and Harden AI Agents with AgentDoG
Tue, Jan 27, 2026 • By Rowan Vale
AgentDoG introduces a diagnostic guardrail that tracks autonomous agent behaviour at trajectory level and attributes unsafe actions to root causes. It uses a three-dimensional taxonomy and the ATBench dataset, and ships open model variants (4B, 7B, 8B). Reported results show stronger safety moderation and clearer provenance for complex, tool-using scenarios.
Agents
Study shows LLMs yield to patient pressure
Mon, Jan 26, 2026 • By Lydia Stratus
A multi-agent evaluation finds large language models (LLMs) used for emergency care often give in to patient persuasion. Across 20 models and 1,875 simulated encounters, acquiescence ranges 0–100%; imaging requests are the most vulnerable. The work shows static benchmarks miss social pressure risks and urges multi-turn adversarial testing and human escalation guards.
Researchers show that generative Large Language Models (LLMs) can rephrase truthful claims using persuasion techniques to evade automated fact-checking. On FEVER and FEVEROUS benchmarks, persuasive rewrites substantially lower verification accuracy and cripple retrieval. Some techniques, especially obfuscation and manipulative wording, can collapse systems when an attacker optimises for maximum damage.
Defenses
Move Privacy Checks to Retrieval, Not Prompts
Wed, Jan 21, 2026 • By James Armitage
New research on SD-RAG shifts privacy and access controls from prompt-time to the retrieval layer in Retrieval-Augmented Generation (RAG). By binding human readable constraints to data chunks and redacting or paraphrasing before generation, the method reduces leakage and resists prompt injection, improving a privacy score by up to 58% in tests while trading some completeness and latency.
Defenses
Study Reveals RCE Risks in Model Hosting
Wed, Jan 21, 2026 • By Elise Veyron
A cross-platform study finds remote code execution (RCE) risks when loading shared machine learning models. Researchers inspect five major hubs and identify roughly 45,000 repositories with load-time custom code, uneven platform safeguards, and common injection and deserialization issues. The findings push for default sandboxing, provenance checks and clearer developer guidance.
Agents
Researchers Expose AgentBait Risk in Web Agents
Tue, Jan 13, 2026 • By Elise Veyron
New research shows how LLM-powered web automation agents can be steered by social cues into unsafe actions. AgentBait attacks average a 67.5% success rate, with peaks above 80%. A pluggable runtime defence, SUPERVISOR, cuts success rates by up to 78.1% while adding about 7.7% runtime overhead.
Agents
Researchers Expose Stealthy Implicit Tool Poisoning in MCP
Tue, Jan 13, 2026 • By Natalie Kestrel
New research demonstrates a stealthy attack called implicit tool poisoning that hides malicious instructions in tool metadata used by MCP (Model Context Protocol) agents. An automated framework, MCP-ITP, achieves high success and evades detectors in tests—up to 84.2% attack success and detection rates as low as 0.3%—highlighting real risks for production deployments.
Defenses
SecureCAI cuts prompt-injection risk for SOC assistants
Tue, Jan 13, 2026 • By Dr. Marcus Halden
SecureCAI defends Large Language Model (LLM) assistants used in Security Operations Centres from prompt-injection attacks. It combines security-focused constitutional rules, continuous red teaming and Direct Preference Optimisation with an unlearning step. The framework cuts attack success by 94.7 percent while keeping benign task accuracy at about 95 percent and preserving rule adherence under pressure.
A recent analysis of the Anthropic Interviewer dataset shows that web-enabled agentic large language models (LLMs) can link anonymised interview transcripts to published papers. The study finds six successful re-identifications from 24 scientist transcripts, at low cost and fast runtimes, raising practical privacy and data release concerns for research custodians.
Looking for older articles?
Browse our complete archive of AI security news and analysis