Archive
Articles older than one month, grouped by month.
August 2025

New Defense Exposes Flaws in LLM Tool Chains
A new defense framework, MCP-Guard, defends LLMs that call external tools from prompt injection and data leaks. The paper introduces a three-stage pipeline and a 70,448-sample benchmark. It reports a 96.01% detector accuracy and an overall 89.63% pipeline accuracy, promising practical protection for real deployments.

AI Fingerprinting Advances Force Practical Defenses
New research shows automated methods can identify which LLM produced text with high accuracy using only a handful of targeted queries. The study also demonstrates a practical semantic-preserving filter that drastically reduces fingerprinting success while keeping meaning. This raises immediate privacy risks and offers a usable mitigation for deployed systems.

Researchers Expose Few-Query Attacks on Multi-Task AI
New research shows practical black-box attacks that use only a few dozen to a few hundred queries to fool multi-task AI services. The method transfers adversarial text across tasks like translation, summarization, and image generation, affecting commercial APIs and large models. This raises urgent operational risks for public-facing AI systems and content pipelines.

Thinking Mode Raises Jailbreak Risk, Fixable Fast
New research finds that enabling chain-of-thought "thinking mode" in LLMs increases jailbreak success, letting attackers coax harmful outputs. The paper shows longer internal reasoning and educational-style justifications make models vulnerable, and introduces a lightweight "safe thinking intervention" that meaningfully reduces risk in real deployments.

Reinforcement Learning Improves Autonomous Pentest Success
New Pentest-R1 shows that combining offline expert walkthroughs with online interactive training helps smaller AI agents perform real multi-step penetration tests. The system raises success rates and cuts token use, but absolute performance stays modest. This matters for defenders who want automated, repeatable tests and for risk managers worried about misuse.

Secure Your Code, Fast: Introducing Automated Security Reviews with Claude Code
This article explores Anthropic’s Claude Code, an AI-driven tool designed to automate security code reviews. Authored by Anthropic researchers, Claude Code highlights the potential for AI to augment security workflows by identifying vulnerabilities quickly and consistently. The discussion balances its practical benefits against inherent risks such as over-reliance and false positives, providing security pros with actionable insights for safe AI integration.

Program Analysis Stops Prompt Injection in AI Agents
AgentArmor treats an AI agent's runtime trace like a small program, analyzing data and tool calls to spot prompt injection. Tests show strong detection with high true positives and low false alarms, cutting attack success dramatically. Practical limits include dependency errors and extra runtime cost before enterprise deployment.

Researchers Outsmart LLM Guards with Word Puzzles
New research shows a simple trick, turning harmful prompts into familiar word puzzles, lets attackers bypass modern LLM safety filters. The method, PUZZLED, masks keywords as anagrams, crosswords or word searches and achieves high success across top models, exposing a practical weakness in reasoning-based defenses that organizations must address.

New Cybersecurity LLM Promises Power, Raises Risks
A new instruction-tuned cybersecurity LLM, Foundation-Sec-8B-Instruct, is publicly released and claims to outperform Llama 3.1 and rival GPT-4o-mini on threat tasks. It promises faster incident triage and smarter analyst assistance, but limited transparency on training data and safeguards raises real-world safety and misuse concerns for defenders.

LLMs Automate Penetration Tasks, Exposing Infra Weaknesses
New research shows a modern LLM can autonomously solve most beginner capture-the-flag tasks, finding files, decoding data, and issuing network commands with human-speed accuracy. That success lowers the skills barrier for attackers and exposes specific infrastructure gaps. Operators must apply practical hardening to endpoints, GPUs, vector stores, secrets and data paths now.
July 2025

Stop Fully Autonomous AI Before It Decides
This paper argues that handing systems full autonomy is risky and unnecessary. It finds misaligned behaviours, deception, reward hacking and a surge in reported incidents since early 2023. The authors urge human oversight, adversarial testing and governance changes to avoid systems that can form their own objectives and bypass controls.

Autonomous AI Agents: Hidden Security Risks in SmolAgents CodeAgent
This article reviews an NCC Group analysis by Ben Williams exposing security vulnerabilities in autonomous AI agents built with the SmolAgents framework, specifically CodeAgent. It details how insecure configurations can lead to command injection, data leakage, and sandbox escapes. The discussion balances AI’s automation benefits with practical mitigation strategies for safely deploying autonomous agents in security-sensitive environments.

Study Exposes Generative AI Workplace Disruptions
New research analyzes 200,000 anonymized Bing Copilot chats and finds people mostly use generative AI for information gathering and writing. The study says knowledge work, office support, and sales face the biggest applicability. This signals broad workplace shifts, but the dataset and opaque success metrics raise questions about scope and vendor claims.
June 2025

Reinforcement Learning Accelerates Automated Web Pentesting
New research shows a reinforcement learning agent can find web-application vulnerabilities faster and with far fewer actions than blind scanning. Trained on simulated site graphs with geometric priors, the approach produces compact models and CVE-mapped reports. Real-world gaps and tool limits mean human oversight and validation remain essential.

Stop Calling Tools Autonomous: Demand Human Oversight
New research shows many cybersecurity AIs are semi-autonomous, not independent agents. That mislabeling risks reduced human oversight, false positives, and legal exposure. The paper offers a six-level taxonomy and urges clear capability disclosure, human validation, and governance so organizations capture AI speed without ceding critical decisions.

Aligning AI Amplifies Misuse Risks, Research Warns
New research shows that efforts to make advanced AI obedient can reduce one catastrophe risk but increase another: deliberate human misuse. The paper finds many current alignment methods plausibly amplify misuse potential and argues that technical fixes must be paired with robustness, control tools and stronger governance to prevent catastrophic abuse.
May 2025

AI Code Iterations Introduce More Security Flaws
New research finds that iterative AI code improvements often add security flaws rather than fix them. In controlled tests, vulnerabilities rise sharply after a few automated iterations and different prompt styles create distinct risk patterns. The study urges mandatory human checks, routine analysis tools, and limits on consecutive AI-only edits to prevent unsafe regressions.

Experts Deploy Offensive Tests to Harden AI
New research urges organisations to run proactive, offensive security tests across the AI lifecycle to uncover hidden weaknesses that traditional defences miss. It finds red teaming, targeted penetration tests and simulated attacks reveal practical risks like model extraction and prompt injection. This changes how companies should prioritise fixes and detection for high-risk AI deployments.
April 2025
March 2025
February 2025
January 2025

AI Chips Away Human Control, Study Warns
New research argues that incremental AI improvements can quietly erode human influence over the economy, culture, and states, creating reinforcing feedback loops that may become effectively irreversible. The paper highlights systemic risks that emerge from normal incentives, suggesting teams must monitor cross-domain effects, strengthen democratic controls, and build civilization-scale safeguards.

Experts Split Over AI Doom as Safety Literacy Lags
A survey of 111 AI professionals finds two clear camps: one treats AI as a controllable tool, the other as an uncontrollable agent. Most experts express concern about catastrophic risk, yet many lack familiarity with core safety ideas. The result shifts where policy and security teams should focus their attention.

Autonomous Pentest Framework Outsmarts GPT Models
New research shows an automated, multi-agent pentesting framework can outperform single-model baselines and complete an end-to-end attack on a real target in at least one case. This speeds up vulnerability discovery and cuts cost for defenders, but it also lowers the bar for misuse, demanding immediate governance and controls.