March 2026
RCR shows LLMs assist Active Directory pentests
A replication report tests whether large language models can autonomously pentest Microsoft Active Directory in a controlled lab. The prototype often finds actionable leads and near-misses but struggles to close the loop. Heavy setup aside, the release gives teams a reproducible way to study AI-assisted internal attacks.
Jailbreak Foundry turns papers into runnable LLM attacks
Jailbreak Foundry (JBF) converts jailbreak papers into runnable modules and scores them in a fixed harness. Across 30 attacks, it reproduces reported results closely (+0.26 pp ASR on average), ships in under an hour for most, and standardises AdvBench tests across 10 models with a consistent GPT-4o judge. It also cuts bespoke code and surfaces model-specific weak spots.
February 2026
Red team uncovers LLM agent leaks, spoofing, DoS
An exploratory red-team exercise on autonomous Large Language Model (LLM) agents shows routine security failures once models gain memory, tools and chat channels. The agents leaked data, followed non-owners, misreported outcomes, triggered denial of service and spread unsafe rules across peers. Identity spoofing across channels and provider-side behaviour shifts compound the risk.
Intent Laundering Breaks Cue-Driven LLM Safety
New research audits AdvBench and HarmBench and finds they overuse obvious trigger phrases. The authors strip cues while keeping malicious intent, then watch Large Language Models (LLMs) comply. Attack success jumps to around 80–87% in one pass, and 90–98.55% with a black‑box loop, including on Gemini 3 Pro and Claude Sonnet 3.7.
Difficulty-aware LLM agents lift pen test success
New research dissects why Large Language Model (LLM) agents often stall in automated penetration testing. It separates fixable tooling gaps from deeper planning failures, then shows difficulty-aware planning improves end-to-end results. Reported gains include up to 91% CTF task completion and better performance on an Active Directory lab than prior systems.
Benchmark tests LLMs on secure code and fixes
SecCodeBench-V2 puts Large Language Model coding assistants through realistic secure coding tasks. It spans 98 scenarios across 22 CWE categories and five languages, using runnable proof-of-concept tests in isolated environments. Results are severity-weighted with Pass@K scoring and include an LLM judge for tricky cases, offering reproducible, comparable security evidence.
Cross-modal attacks outwit vision-language model defences
A new paper introduces CrossTALK, a cross-modal entanglement attack that spreads clues across images and text to bypass vision-language model defences. Experiments on nine mainstream models show high success and detailed harmful outputs, highlighting gaps in cross-modal alignment and the need for adversarial testing and cross-modal safety checks in deployed VLM systems.
Governed GenAI streamlines Wi-Fi pentesting with oversight
WiFiPenTester folds Large Language Models into wireless reconnaissance and decision support to rank targets, estimate feasibility, and suggest strategies, while keeping humans firmly in control. A Kali-based proof-of-concept logs evidence, gates model spend, and separates AI reasoning from radio actions. Gains in accuracy and efficiency come with privacy, legal, and prompt-sensitivity caveats.
December 2025
Planner-led Agents Boost Automated Penetration Testing
Researchers evaluate Large Language Model (LLM) agents for automated penetration testing and introduce CHECKMATE, which pairs an explicit classical planner with LLM executors. CHECKMATE raises benchmark success rates by over 20 percent, halves time and cost versus the best agent baseline, and improves stability while exposing remaining limits in long-horizon planning and tool use.
AI agents match pen testers, expose new risks
A live comparison pits autonomous AI agents against ten human penetration testers on a large university network. The ARTEMIS multi agent scaffold places second, finding nine valid vulnerabilities with an 82% validity rate and clear cost advantages. The study flags higher false positives, GUI automation gaps, and new dual use concerns.
New TeleAI-Safety Benchmark Exposes LLM Jailbreak Risks
TeleAI-Safety gives security teams a modular, reproducible way to probe large language model (LLM) jailbreaks. The benchmark tests 342 attack samples across 14 models with 19 attack techniques, 29 defence methods and 19 evaluation approaches. Results show model-specific weak points, evaluator variability and the need for multi-layer, ongoing testing in live deployments.
November 2025
Study Finds Widespread Vulnerabilities in AI C/C++ Code
Researchers test ten Large Language Models (LLMs) that generate C and C++ code and find many outputs contain common, real-world vulnerabilities. Static scanners report dozens of Common Weakness Enumeration (CWE) instances, some mapping to recorded Common Vulnerabilities and Exposures (CVEs). The study urges treating AI-produced code as untrusted and adding security checks.
Benchmarks expose LLMs' weakness to authority prompts
PARROT, a new robustness framework, tests how social pressure from authoritative prompts pushes Large Language Models (LLMs) to agree with false assertions. Evaluating 22 models on 1,302 multiple choice items, the study finds wide variance: modern systems resist persuasion, older and smaller models often follow and boost confidence in wrong answers, creating real-world risk.
ForgeDAN exposes gaps in aligned LLM safeguards
ForgeDAN is an evolutionary attack framework that crafts subtle prompts to bypass safeguards in aligned Large Language Models (LLMs). The paper finds it outperforms prior methods, achieving high success on several models, and shows that simple keyword filters and shallow detectors leave an exploitable surface. The study urges layered defences and continual red-teaming.
Bad fine-tuning data breaks small language models
Researchers test 23 small language models and find that modest contamination of instruction data can wreck behaviour. Simple syntactic edits, such as reversing characters, often collapse performance; semantic corruptions can steer models toward harmful outputs once exposure passes a threshold. Larger models can be more easily hijacked, creating supply-chain risks for deployment.
Automated Multimodal Jailbreaks Reveal VLM Weaknesses
New research introduces JPRO, a black-box, multi-agent framework that automates jailbreaking of vision-language models (VLMs). It chains planning, attack, modification and verification to produce diverse image-plus-text attacks and achieves over 60% success against several advanced VLMs. The work highlights practical risks for deployed multimodal endpoints and the need for stronger defences.
Teach LLMs Security Specs to Find Bugs
Researchers introduce VulInstruct, a method that teaches Large Language Models (LLMs) explicit security specifications mined from past patches and CVEs to detect vulnerabilities. On a strict benchmark it raises F1 and recall substantially and uniquely finds many bugs. The approach even uncovered a real high severity CVE, showing practical value for automated code review.
October 2025
Genesis evolves attack strategies against LLM web agents
Genesis presents an automated red-teaming framework that evolves attacks against web agents driven by large language models (LLMs). Its Attacker, Scorer and Strategist modules generate, evaluate and summarise adversarial payloads. The system finds transferable strategies, beats static baselines, and shows defenders need continuous, data-driven testing and stronger interaction controls.
HackWorld Tests AI Agents Against Web App Flaws
HackWorld evaluates computer‑use agents (CUAs) on 36 real web applications and finds exploitation success below 12%. Agents can perceive pages but struggle to plan multi‑step attacks, orchestrate tools and recover from errors. The study highlights gaps to close before autonomous agents become a scalable automated attack vector and points to practical mitigations.
RedTWIZ Exposes LLM Jailbreaks with Adaptive Planner
RedTWIZ is an adaptive, multi-turn red teaming framework that systematically probes Large Language Model (LLM) safety. The authors show multi-turn, goal-oriented jailbreaks can coax state-of-the-art models to produce unsafe code and explanations. Their hierarchical planner and diverse attack suite outperform naive approaches, exposing gaps in guardrails for AI-assisted software development.
AutoPentester Automates Red-Team Tasks, Reveals Gaps
AutoPentester uses a Large Language Model (LLM) agent to automate end-to-end penetration testing and yields measurable gains versus PentestGPT. The framework raises subtask completion and vulnerability coverage while cutting human interactions, but introduces automation overhead and new risks such as prompt injection and hallucination that teams must mitigate before deployment.
AI agents fuzz industrial control protocols effectively
Researchers present MALF, a multi-agent Large Language Model (LLM) fuzzing framework that finds protocol-aware faults in industrial control systems (ICS). Using Retrieval-Augmented Generation (RAG) and QLoRA tuning, MALF reports 88–92% test pass rates, broad protocol coverage, many exception triggers and three zero-days in a power-plant range, highlighting both defensive value and dual-use risk.
September 2025
MCP tool poisoning steers LLM agents at scale
This paper shows that Model Context Protocol (MCP) tools can be poisoned to steer Large Language Model (LLM) agents. An automated framework, AutoMalTool, generates malicious tools with about 85% generation success and roughly 35% real-agent effectiveness while evading detectors. The finding exposes a scalable attack surface and a gap in current defences.
Memory aids RL pen-testing robustness and transfer
Researchers train reinforcement learning agents to run simulated, partially observable penetration tests and compare policy variants. Augmenting observations with recent history outperforms recurrent and transformer models, converging about three times faster and generalising better across network sizes. The work flags gaps in observability and urges memory-aware defences against automated attacks.
Automated Red-Teaming Exposes Global AI Disinformation Gaps
A new method called anecdoctoring automates multilingual adversarial prompt generation using nearly 9,815 fact-checked claims in English, Spanish and Hindi from the US and India. By clustering narratives and adding knowledge graphs, it raises attack success rates above 80% for several models and shows where English-centric safety testing leaves dangerous blind spots.
Ads Enable LLMs to Reconstruct User Profiles
Researchers audit social media ad streams and show they can reveal sensitive user attributes when analysed with multimodal Large Language Models. The study finds algorithmic skew in political and gambling ads and reports LLMs reconstruct gender, age and other demographics well above baseline, creating privacy and targeting risks for users and organisations.
MUSE exposes and hardens multi-turn LLM jailbreaks
MUSE is a new framework that both probes and patches multi-turn jailbreaks in conversational AI. Its attack module uses semantic strategies and Monte Carlo Tree Search to discover context-driven bypasses, and its defence fine-tunes models at the turn level to cut successful multi-turn exploits while keeping reasoning intact.
New Benchmark Shows AI Pentesters Fail Real Targets
A new real-world benchmark and agent, TermiBench and TermiAgent, test AI-driven penetration tools beyond toy capture-the-flag setups. The research shows most existing agents struggle to obtain system shells, while TermiAgent improves success with memory-focused reasoning and structured exploit packaging. This raises practical security concerns and governance questions for defenders and policy makers.
AI Powers Android Exploits and Shifts Pentesting
New research shows large language models can automate Android exploitation workflows, speeding up rooting and privilege escalation in emulated environments. The study warns these AI-generated scripts can be misused at scale, highlights emulator limits, and urges human oversight and defence-aware toolchains to prevent automation from becoming an attacker force multiplier.
Anchor LLMs with ATT&CK, Cut Pentest Hallucinations
New research shows constraining LLM-driven penetration testing to a fixed MITRE ATT&CK task tree dramatically cuts hallucinations and redundant queries while raising task completion rates across models. The method speeds automated assessments, helps smaller models succeed, and warns defenders to update mappings before attackers and tools weaponize the same guided approach.
LLMs Fail to Fix Real Exploitable Bugs
New exploit-driven testing finds that popular large language models fail to reliably repair real, exploitable Python vulnerabilities. Researchers run 23 real CVEs with working proof-of-concept exploits and show top models fix only 5 cases. The result warns that AI patches often leave attack surfaces and need exploit-aware checks before deployment.
Audit Reveals LLMs Spit Out Malicious Code
A scalable audit finds production LLMs sometimes generate code containing scam URLs even from harmless prompts. Testing four models, researchers see about 4.2 percent of programs include malicious links and identify 177 innocuous prompts that trigger harmful outputs across all models. This suggests training data poisoning is a practical, deployable risk.
Researchers Turn AI Security Tools Into Attack Vectors
New research shows AI-powered cybersecurity tools can be hijacked through prompt injection, where malicious text becomes executable instructions. Proof-of-concept attacks compromise unprotected agents in seconds with a 91 percent success rate. Multi-layer defenses can block these exploits, but researchers warn the fixes are fragile and require ongoing vigilance.
August 2025
Train Agents to Find Vulnerabilities at Scale
Researchers build CTF-Dojo and CTF-Forge, a scalable runtime and automation pipeline that trains language-model agents on containerized capture-the-flag challenges. They show small verified training sets yield big gains in exploit-finding ability, improving open models while raising clear risks for misuse. This forces urgent, practical containment and control decisions.
Reinforcement Learning Improves Autonomous Pentest Success
New Pentest-R1 shows that combining offline expert walkthroughs with online interactive training helps smaller AI agents perform real multi-step penetration tests. The system raises success rates and cuts token use, but absolute performance stays modest. This matters for defenders who want automated, repeatable tests and for risk managers worried about misuse.
LLMs Automate Penetration Tasks, Exposing Infra Weaknesses
New research shows a modern LLM can autonomously solve most beginner capture-the-flag tasks, finding files, decoding data, and issuing network commands with human-speed accuracy. That success lowers the skills barrier for attackers and exposes specific infrastructure gaps. Operators must apply practical hardening to endpoints, GPUs, vector stores, secrets and data paths now.
