Prompt bank separates executable malware code from knowledge

Pentesting

Published: Tue, May 05, 2026 • By Lydia Stratus

Prompt bank separates executable malware code from knowledge

New research releases a validated Large Language Model prompt bank that cleanly separates requests for executable malicious code from security knowledge. A five-model panel labels 1,554 CODE and 388 KNOWLEDGE prompts with high agreement (kappa 0.876). The split clarifies refusal testing and exposes where guardrails treat code and information differently.

Benchmarks that mix “write me malware” with “explain how malware works” have been giving us mushy refusal stats. This paper fixes the measurement problem with a validated prompt bank that separates executable software requests from general security knowledge, then shows the split holds up under cross-vendor scrutiny.

The team pulled 3,133 prompts from four public benchmarks and ran them through a five-judge panel spanning Anthropic, OpenAI, Google, Zhipu AI and Alibaba model families: Claude Sonnet 4.6, GPT-5.3-Codex, Gemini 3 Flash, GLM-5, and Qwen3-Coder-Next. Each judge issued a binary label, CODE or KNOWLEDGE. A three-of-five majority locked the decision. Inter-rater reliability came out high: Fleiss’ kappa 0.876 with a 95% CI of [0.862, 0.888], and 69.3% of prompts were unanimous. After quality checks, they released 1,554 CODE prompts and a 388-prompt KNOWLEDGE comparison set. Every prompt reached the consensus threshold; none were kicked to an ambiguous bucket.

Per-source reliability varied. MalwareBench landed perfect agreement. The harmful_behaviors slice was also strong, CySecBench was lower, and RMCBench was degenerate due to no label variance. Coder-specialised judges tended to call more CODE, and the panel still held up even with heterogeneous individual error counts, which is what you want if you plan to reuse this pipeline.

Why this matters for attackers

Most production guardrails look for surface cues: imperative phrasing and code-shaped output. If your policy lumps executable code with knowledge, you either over-block and nuke legitimate triage help, or under-block and let weaponised snippets slide. The split here maps to two refusal pathways that models actually use. That matters at the model endpoint where an attacker can shape the prompt: start with a “knowledge” request that sounds like a purple-team exercise, then pivot to code in the second turn. Single-turn benchmarks miss that. If your filters only spike on explicit code-generation verbs, the first hop gets through, the second lands payload.

From an ops angle, this bank gives red teams a standard set of CODE prompts to probe vendor policies and compare results, without the usual noise of mixed-intent corpora. It also tells defenders when a refusal is driven by executable generation rather than information sharing, which is useful when you are instrumenting API gateways and logging for policy drift.

Limitations are noted: all-LLM judges, English-only sources, a snapshot of four benchmarks, and single-turn prompts. The dataset is gated with responsible-use restrictions. Still, as a substrate for red teaming and cross-vendor evaluation, it is cleaner than what most of us have been using. The interesting open question is how quickly multi-turn and reformulation attacks collapse the neat CODE versus KNOWLEDGE boundary in the wild.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Authors: Richard J. Young and Gregory D. Moody

Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba), each issuing a binary CODE/KNOWLEDGE label per prompt under a three-of-five majority rule, with inter-rater reliability quantified by Fleiss' kappa with bootstrap 95% confidence intervals. Across all 3,133 prompts the five judges achieve kappa = 0.876 [95% CI: 0.862, 0.888], "almost perfect" agreement by the Landis & Koch convention, with 69.3% of prompts unanimous at five-of-five; all 3,133 prompts reached the 3-of-5 threshold, so the consensus pipeline produced zero ambiguity-excluded prompts. Whether the axis separates model behavior in practice is an empirical question this paper leaves to the companion benchmark study; the present contribution is the reliability-documented artifact and the case for treating the weapons-versus-knowledge distinction as the organizing axis of code-safety evaluation.

🔍 ShortSpan Analysis of the Paper

Problem

This paper addresses a methodological confound in evaluations of language-model refusal on malicious-coding tasks: existing benchmarks mix requests for executable malicious software with requests for harmful security knowledge. That conflation matters because executable code and descriptive security knowledge plausibly trigger different refusal pathways in safety-aligned models, so a single refusal-rate statistic over a mixed corpus cannot isolate which behaviour a model exhibits. The work aims to operationalise a weapons-versus-knowledge axis and produce a validated prompt bank to support cleaner measurement and cross-study comparability.

Approach

The authors consolidated 17,175 raw prompts from four public benchmarks into a 3,133-prompt candidate pool by deduplication and a regex-based pre-filter for imperative, code-shaped prompts. Each candidate was independently classified as CODE (executable software request) or KNOWLEDGE (information request) by a vendor-diverse five-model panel: Claude Sonnet 4.6, GPT-5.3-Codex, Gemini 3 Flash, GLM-5, and Qwen3-Coder-Next. Labels were adjudicated by a 3-of-5 majority rule. Inter-rater reliability was measured with Fleiss' kappa and 10,000-iteration bootstrap 95% confidence intervals. Final length and metadata quality checks produced the released artefact and accompanying metadata; classification code and labels are released under MIT while prompt text is gated for bona fide researchers.

Key Findings

The consensus pipeline produced 1,554 consensus-CODE records and 388 consensus-KNOWLEDGE prompts, 1,942 prompts in total after quality passes.
Five-model agreement is very high: Fleiss' kappa = 0.876 with 95% CI [0.862, 0.888], characterised as "almost perfect"; 69.3% of prompts were unanimous (5/5).
No prompt failed the 3-of-5 threshold; the AMBIGUOUS bucket was empty for this dataset run.
Per-source reliability varied: MalwareBench subset reached perfect agreement (kappa = 1.000), harmful_behaviors kappa = 0.942, CySecBench kappa = 0.775, while RMCBench showed a degenerate near-zero kappa due to no label variance.
Judge behaviour: coder-specialised models produced higher CODE counts on average; coder-pair internal agreement was 93.7%, general-purpose judges 89.7%, and cross-family disagreement was 7.6%.
The five-judge panel tolerated heterogeneous error rates (individual judge errors ranged from 1 to 430) without losing consensus labels due to the 3-of-5 rule.

Limitations

Main constraints include reliance on an all-LLM panel with no human-annotator baseline, English-only sources, a snapshot tied to early-2026 benchmark versions, a finite and non-exhaustive judge panel composition, a deliberately conservative 3-of-5 rule that may exclude borderline prompts in other datasets, and single-turn prompts that do not capture multi-turn adversarial reformulations.

Implications

Offensive-security implications: separating executable-code requests from knowledge requests reveals that models may implement distinct refusal pathways, so attackers could exploit benchmarking conflation to hide exploitative intents. An adversary might craft prompts that resemble benign knowledge requests or exploit multi-turn strategies to convert knowledge into executable artefacts, bypassing guards tuned to explicit code-generation phrasing. The validated bank can also inform attackers which prompt constructions are more likely to be classified as executable requests versus knowledge, enabling targeted probing of guardrails. Conversely, the artefact provides a standard substrate for red-teaming and stress-testing defences across vendors, but its availability is gated and accompanied by responsible-use restrictions.

Links Original paper on arXiv

Prompt bank separates executable malware code from knowledge

Why this matters for attackers

📋 Original Paper Title and Abstract

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Camouflaged Jailbreaks Expose LLM Safety Blindspots

Puzzle Prompts Make LLM Agents Exploit Vulnerabilities

Iterative LLM jailbreaks produce executable attack code

Related Research

Get the Weekly AI Security Digest