Prompt bank separates executable malware code from knowledge
Pentesting
Benchmarks that mix “write me malware” with “explain how malware works” have been giving us mushy refusal stats. This paper fixes the measurement problem with a validated prompt bank that separates executable software requests from general security knowledge, then shows the split holds up under cross-vendor scrutiny.
The team pulled 3,133 prompts from four public benchmarks and ran them through a five-judge panel spanning Anthropic, OpenAI, Google, Zhipu AI and Alibaba model families: Claude Sonnet 4.6, GPT-5.3-Codex, Gemini 3 Flash, GLM-5, and Qwen3-Coder-Next. Each judge issued a binary label, CODE or KNOWLEDGE. A three-of-five majority locked the decision. Inter-rater reliability came out high: Fleiss’ kappa 0.876 with a 95% CI of [0.862, 0.888], and 69.3% of prompts were unanimous. After quality checks, they released 1,554 CODE prompts and a 388-prompt KNOWLEDGE comparison set. Every prompt reached the consensus threshold; none were kicked to an ambiguous bucket.
Per-source reliability varied. MalwareBench landed perfect agreement. The harmful_behaviors slice was also strong, CySecBench was lower, and RMCBench was degenerate due to no label variance. Coder-specialised judges tended to call more CODE, and the panel still held up even with heterogeneous individual error counts, which is what you want if you plan to reuse this pipeline.
Why this matters for attackers
Most production guardrails look for surface cues: imperative phrasing and code-shaped output. If your policy lumps executable code with knowledge, you either over-block and nuke legitimate triage help, or under-block and let weaponised snippets slide. The split here maps to two refusal pathways that models actually use. That matters at the model endpoint where an attacker can shape the prompt: start with a “knowledge” request that sounds like a purple-team exercise, then pivot to code in the second turn. Single-turn benchmarks miss that. If your filters only spike on explicit code-generation verbs, the first hop gets through, the second lands payload.
From an ops angle, this bank gives red teams a standard set of CODE prompts to probe vendor policies and compare results, without the usual noise of mixed-intent corpora. It also tells defenders when a refusal is driven by executable generation rather than information sharing, which is useful when you are instrumenting API gateways and logging for policy drift.
Limitations are noted: all-LLM judges, English-only sources, a snapshot of four benchmarks, and single-turn prompts. The dataset is gated with responsible-use restrictions. Still, as a substrate for red teaming and cross-vendor evaluation, it is cleaner than what most of us have been using. The interesting open question is how quickly multi-turn and reformulation attacks collapse the neat CODE versus KNOWLEDGE boundary in the wild.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
🔍 ShortSpan Analysis of the Paper
Problem
This paper addresses a methodological confound in evaluations of language-model refusal on malicious-coding tasks: existing benchmarks mix requests for executable malicious software with requests for harmful security knowledge. That conflation matters because executable code and descriptive security knowledge plausibly trigger different refusal pathways in safety-aligned models, so a single refusal-rate statistic over a mixed corpus cannot isolate which behaviour a model exhibits. The work aims to operationalise a weapons-versus-knowledge axis and produce a validated prompt bank to support cleaner measurement and cross-study comparability.
Approach
The authors consolidated 17,175 raw prompts from four public benchmarks into a 3,133-prompt candidate pool by deduplication and a regex-based pre-filter for imperative, code-shaped prompts. Each candidate was independently classified as CODE (executable software request) or KNOWLEDGE (information request) by a vendor-diverse five-model panel: Claude Sonnet 4.6, GPT-5.3-Codex, Gemini 3 Flash, GLM-5, and Qwen3-Coder-Next. Labels were adjudicated by a 3-of-5 majority rule. Inter-rater reliability was measured with Fleiss' kappa and 10,000-iteration bootstrap 95% confidence intervals. Final length and metadata quality checks produced the released artefact and accompanying metadata; classification code and labels are released under MIT while prompt text is gated for bona fide researchers.
Key Findings
- The consensus pipeline produced 1,554 consensus-CODE records and 388 consensus-KNOWLEDGE prompts, 1,942 prompts in total after quality passes.
- Five-model agreement is very high: Fleiss' kappa = 0.876 with 95% CI [0.862, 0.888], characterised as "almost perfect"; 69.3% of prompts were unanimous (5/5).
- No prompt failed the 3-of-5 threshold; the AMBIGUOUS bucket was empty for this dataset run.
- Per-source reliability varied: MalwareBench subset reached perfect agreement (kappa = 1.000), harmful_behaviors kappa = 0.942, CySecBench kappa = 0.775, while RMCBench showed a degenerate near-zero kappa due to no label variance.
- Judge behaviour: coder-specialised models produced higher CODE counts on average; coder-pair internal agreement was 93.7%, general-purpose judges 89.7%, and cross-family disagreement was 7.6%.
- The five-judge panel tolerated heterogeneous error rates (individual judge errors ranged from 1 to 430) without losing consensus labels due to the 3-of-5 rule.
Limitations
Main constraints include reliance on an all-LLM panel with no human-annotator baseline, English-only sources, a snapshot tied to early-2026 benchmark versions, a finite and non-exhaustive judge panel composition, a deliberately conservative 3-of-5 rule that may exclude borderline prompts in other datasets, and single-turn prompts that do not capture multi-turn adversarial reformulations.
Implications
Offensive-security implications: separating executable-code requests from knowledge requests reveals that models may implement distinct refusal pathways, so attackers could exploit benchmarking conflation to hide exploitative intents. An adversary might craft prompts that resemble benign knowledge requests or exploit multi-turn strategies to convert knowledge into executable artefacts, bypassing guards tuned to explicit code-generation phrasing. The validated bank can also inform attackers which prompt constructions are more likely to be classified as executable requests versus knowledge, enabling targeted probing of guardrails. Conversely, the artefact provides a standard substrate for red-teaming and stress-testing defences across vendors, but its availability is gated and accompanied by responsible-use restrictions.