Benchmark tests LLMs on secure code and fixes

Pentesting

Published: Thu, Feb 19, 2026 • By Theo Solander

Benchmark tests LLMs on secure code and fixes

SecCodeBench-V2 puts Large Language Model coding assistants through realistic secure coding tasks. It spans 98 scenarios across 22 CWE categories and five languages, using runnable proof-of-concept tests in isolated environments. Results are severity-weighted with Pass@K scoring and include an LLM judge for tricky cases, offering reproducible, comparable security evidence.

Security teams keep asking the same grounded question about coding assistants powered by Large Language Models (LLMs): do they actually produce secure code when it matters, inside real projects with real constraints? A new benchmark, SecCodeBench-V2, tries to answer that in a way defenders can use rather than admire from afar.

The authors assemble 98 scenarios drawn from industrial cases, mapped to 22 Common Weakness Enumeration categories across Java, C, Python, Go and Node.js. Each case is scoped at the function level but ships with a full scaffold, fixed interfaces and dependencies, so the model cannot wave away edge cases with hand-waving pseudocode. Every scenario includes executable proof-of-concept tests for both behaviour and security. The harness compiles and runs model outputs in isolated containers, checks functionality first, then fires the security PoCs. If a candidate fails to build or a test, the pipeline can feed back errors and retry up to three times, echoing the dull but real debugging loop engineers live with.

For vulnerabilities where a crisp oracle is hard to encode — think semantics-heavy issues like weak cryptography or hard-coded credentials — the framework adds an LLM-as-judge panel and uses majority voting. It supports four prompt settings per case: generate, generate-with-hints, fix, and fix-with-hints. To make scores comparable, it uses a Pass@K protocol with multi-round sampling (default ten rounds). Results are severity-aware: Medium counts as 1.0, High as 2.0 and Critical as 4.0. Scenario weighting favours the native generate/fix settings at 4.0, with hint-augmented variants at 1.0. The outputs include weighted and unweighted summaries, per-language and per-scenario breakdowns, and detailed traces, which is exactly what you want when a red teamer asks “show your working”.

If this all sounds familiar, that is because it rhymes with earlier eras. In the early days, we measured software quality with tidy microbenchmarks and static checklists, then discovered real workloads behave like weather, not clockwork. Security followed the same arc: scanners gave comfort; exploit-driven tests changed minds. SecCodeBench-V2 sits in the latter camp. It does not ask the model whether code is safe; it runs the code and tries to break it. There is a whiff of the old lab bench in the way it corrals languages, CWEs and test harnesses, but the spirit is practical: can this assistant ship a secure function here, today, under these constraints?

There are caveats. Dynamic PoCs give strong evidence within the benchmark’s threat model, but they cannot prove the absence of other flaws the tests do not exercise. Some verdicts rely on LLM judgement, which adds a useful semantic signal but is, by nature, non-deterministic. Still, the combination of execution-driven checks, severity-weighted scoring and reproducible artefacts creates a shared baseline that teams can audit and discuss without hand-waving.

For practitioners trialling LLM copilots, this provides a way to evaluate security posture across languages and weakness classes, and to spot recurring failure modes. You can weigh native generate versus fix settings, study where hints help, and budget risk by severity rather than by raw pass rate. It is not the final word on secure AI-assisted development, but it is a sturdier yardstick than most. History suggests that once the community rallies around such yardsticks, tools and practices improve faster. The present work gives that rallying point something solid to push against.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SecCodeBench-V2 Technical Report

Authors: Longfei Chen, Ji Zhao, Lanxiao Cui, Tong Su, Xingbo Pan, Ziyang Li, Yongxing Wu, Qijiang Cao, Qiyao Cai, Jing Zhang, Yuandong Ni, Junyao He, Zeyu Zhang, Chao Ge, Xuhuai Lu, Zeyu Gao, Yuxin Cui, Weisen Chen, Yuxuan Peng, Shengping Wang, Qi Li, Yukai Huang, Yukun Liu, Tuo Zhou, Terry Yue Zhuo, Junyang Lin, and Chao Zhang

We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and Node.js. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec-code-bench. The benchmark is publicly available at https://github.com/alibaba/sec-code-bench.

🔍 ShortSpan Analysis of the Paper

Problem

The paper introduces SecCodeBench-V2, a benchmark designed to measure how well Large Language Model coding assistants generate and repair secure code in realistic engineering settings. It addresses gaps in existing evaluations that rely on short synthetic snippets, static analysis, or coarse scoring, and that often lack realistic project context, executable verification, multi-language coverage and severity-aware aggregation.

Approach

SecCodeBench-V2 comprises 98 scenarios drawn from de-identified industrial vulnerability cases covering 22 CWE categories across five languages: Java, C, Python, Go and Node.js. Each scenario is a function-level task that supplies a complete project scaffold and requires implementing or patching a target function under fixed interfaces and dependencies. Security experts author and double-review both functional and security proof-of-concept tests. Evaluation is execution-driven: model outputs are compiled and run inside Docker-isolated validators that enforce a functionality-then-security protocol. For semantics-heavy cases where deterministic tests are insufficient, an LLM-as-a-judge panel provides complementary assessments via majority voting. The benchmark supports four prompt scenarios per case: gen, gen-hints, fix and fix-hints. A Pass@K scoring protocol (Pass@1 by default) with multi-round evaluation (default R=10) and principled two-dimensional weighting aggregates results across scenarios and severity levels. Severity weights map Medium to 1.0, High to 2.0 and Critical to 4.0; scenario weights prioritise native gen/fix settings at 4.0 and hint-augmented variants at 1.0. The framework is modular, configurable and publicly released with pipeline artefacts and logs to support reproducibility.

Key Findings

The benchmark contains 98 realistic generation and fix scenarios spanning 22 CWE types and five languages, enabling broad coverage of industrial vulnerability classes.
Evaluation is primarily dynamic: generated artefacts are compiled and executed in sandboxed containers and subjected to functional tests before security PoC tests, reducing false positives relative to static-only approaches.
LLM-as-a-judge is used for semantics-heavy vulnerabilities such as weak cryptography or hard-coded credentials, with an odd-sized panel and majority voting to produce verdicts where determinism is infeasible.
The framework enforces an iterative repair workflow: if a candidate fails functional checks, it retries up to r=3 times by returning error messages to the model, mimicking real-world debugging.
A Pass@1 baseline aggregated with severity and scenario weights provides both weighted and unweighted scores, per-scenario and per-language breakdowns, and detailed per-test-case traces to support diagnosis and model comparison.

Limitations

Dynamic PoC tests provide strong evidence of exploitability under the benchmark’s threat model but do not guarantee absence of vulnerabilities beyond exercised inputs and paths. The authors mitigate this with atomic scenarios, multiple exploit strategies per case and expert review, but results should be interpreted as security evidence within the benchmark scope rather than formal proofs. Some assessments rely on LLM judgements, which add a semantic but non-deterministic signal.

Why It Matters

SecCodeBench-V2 supplies a reproducible, industrially grounded framework to assess and compare LLM-based coding assistants on usable and secure code generation across languages and severity levels. Its execution-driven validation, severity-aware scoring and per-case artefacts help practitioners and researchers audit model security posture, identify recurring failure modes, guide targeted mitigation efforts and evaluate improvements. The public release of the benchmark, tests and pipeline supports reproducible auditing of AI-assisted development tools.

Attribution Original paper on arXiv