GoodVibe hardens LLM code generation via neuron tuning

Defenses

Published: Sat, Feb 14, 2026 • By Natalie Kestrel

GoodVibe hardens LLM code generation via neuron tuning

GoodVibe is a neuron-level method that improves code security in Large Language Models (LLMs) without retraining entire models. It finds a small set of security-relevant neurons with gradient attribution, clusters them, and fine-tunes only those clusters. The paper reports big gains in safe code for C++, Java, Swift and Go while cutting compute and trainable parameters dramatically.

Code-generation models are increasingly part of everyday development, and many teams lean on fast, informal workflows where security is an afterthought. The paper GoodVibe targets that gap. It argues that security reasoning inside a Large Language Model (LLM) concentrates in a compact set of neurons, and that identifying and tuning just those neurons can materially improve the security of generated code without wholesale retraining.

How it works

GoodVibe proceeds in two stages. First, it reframes security assessment as a supervised task and uses gradient-based attribution to score each neuron for its influence on security decisions. The top-k neurons per layer form a security-critical subspace. Second, the method freezes the rest of the model and fine-tunes only that subspace. To keep training practical and stable, the authors cluster neurons by activation similarity and learn cluster-level updates rather than independent per-neuron weights.

The experiments are broad for a paper of this type. GoodVibe is applied to six LLMs and evaluated across C++, Java, Swift and Go. Results are striking on the headline metrics: reported safe-response rates reach as high as 87.5% on C++ and 76.0% on Java. By contrast, some base models start from single-digit safety rates on the same benchmarks. The paper also emphasises efficiency: average trainable parameters are under 3 million (about 1.9 million), the reported compute for fine-tuning a 7B model is lower than both full fine-tuning and a LoRA baseline, and the authors claim improvements that match or exceed full fine-tuning while using orders of magnitude fewer trainable parameters.

What is solid and what to question

The technical premise is plausible and the ablations strengthen it. Gradient attribution outperforms activation-only selection in their tests, and clustering reduces instability while keeping parameter counts low. The modest impact on general benchmarks (an average drop of 0.84% across GSM8K, ARC and MMLU) supports the claim that utility is largely preserved.

That said, the paper leaves several practical questions open. The security labels and measurements rely on an automated judge model; the authors validate it, but automated judges miss context subtleties and real-world exploitability. The method explicitly targets benign use cases and does not address adversarial prompting, jailbreaks or active attempts to bypass a tuned subspace. The work also assumes a transferable security subspace and stable gradient signals across models and deployments; those assumptions may not hold for all architectures, tokenisers or future model updates.

There are resilience questions the paper flags but does not resolve. If security-critical reasoning is localised, an attacker with knowledge of the approach might try to manipulate activations or craft prompts that shift model behaviour out of the tuned subspace. The paper notes these risks and frames them as directions for future work rather than problems it solves today.

For security teams and product owners the takeaway is pragmatic. GoodVibe shows a focused, auditable route to reduce insecure code generation without full retraining, and it does so with manageable compute. But it is not a silver bullet. Organisations should treat this as a layer in a defence-in-depth strategy, validate results with human review and real-world attack scenarios, and watch for subspace drift if models or toolchains change.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

GoodVibe: Security-by-Vibe for LLM-Based Code Generation

Authors: Maximilian Thang, Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Jona te Lintelo, Stjepan Picek, and Ahmad-Reza Sadeghi

Large language models (LLMs) are increasingly used for code generation in fast, informal development workflows, often referred to as vibe coding, where speed and convenience are prioritized, and security requirements are rarely made explicit. In this setting, models frequently produce functionally correct but insecure code, creating a growing security risk. Existing approaches to improving code security rely on full-parameter fine-tuning or parameter-efficient adaptations, which are either costly and prone to catastrophic forgetting or operate at coarse granularity with limited interpretability and control. We present GoodVibe, a neuron-level framework for improving the security of code language models by default. GoodVibe is based on the key insight that security-relevant reasoning is localized to a small subset of neurons. We identify these neurons using gradient-based attribution from a supervised security task and perform neuron-selective fine-tuning that updates only this security-critical subspace. To further reduce training cost, we introduce activation-driven neuron clustering, enabling structured updates with minimal overhead. We evaluate GoodVibe on six LLMs across security-critical programming languages, including C++, Java, Swift, and Go. GoodVibe substantially improves the security of generated code while preserving general model utility, achieving up to a 2.5x improvement over base models, matching or exceeding full fine-tuning with over 4,700x fewer trainable parameters, and reducing training computation by more than 3.6x compared to the parameter-efficient baseline (LoRA). Our results demonstrate that neuron-level optimization offers an effective and scalable approach to securing code generation without sacrificing efficiency or generality.

🔍 ShortSpan Analysis of the Paper

Problem

This work studies the security of code generation by large language models used in vibe coding, where developers prioritise speed and informal workflows and explicit security requirements are often absent. Although models frequently produce functionally correct code, it can be insecure. Existing approaches rely on full parameter fine tuning or parameter efficient adaptations, which are costly, fragile or operate at coarse granularity with limited interpretability. The paper introduces GoodVibe, a neuron level framework that aims to harden code generation by default. The central premise is that security related reasoning is localised to a small subset of neurons, which can be identified with gradient based attribution from a supervised security task and then selectively fine tuned. An activation driven clustering mechanism reduces training cost by enabling structured updates. GoodVibe is evaluated on six LLMs across security critical languages including C++, Java, Swift and Go, and claims substantial security improvements while preserving general model utility, with up to 2.5x improvement over base models, matching or exceeding full fine tuning with over 4700x fewer trainable parameters and reducing training computation by more than 3.6x compared with LoRA. The work argues that neuron level optimisation offers an effective and scalable approach to secure code generation without sacrificing efficiency or generality, and discusses potential resilience concerns around subspace manipulation and attacker strategies.

Approach

GoodVibe proceeds in two stages. First, security neurons are identified by recasting security assessment as a supervised task and using gradient based attribution to measure each neuron's influence on security related decisions. For each transformer layer, the top k neurons with the highest importance scores are selected to define a security critical subspace. Second, fine tuning is restricted to this subspace while all other parameters are frozen. To improve efficiency and stability, security neurons are clustered by similarity using k means, and cluster level updates are learned rather than independent neuron updates. This yields updates that scale with cluster count rather than neuron count. The process uses two epochs of training with AdamW, a learning rate of 1e-4, a cosine schedule and a 0.1 warmup, with mixed precision and gradient checkpointing to reduce memory use. Training identifi cation and subsequent fine tuning are performed with controlled data derived from a security evaluation dataset, and the final updates are folded back into standard weights for inference. Evaluation examines security effectiveness via a separate automated judge, efficiency in terms of trainable parameters and FLOPs, and utility preservation on standard reasoning and language benchmarks.

Key Findings

Security effectiveness: across six models and multiple languages, GoodVibe achieves up to 87.5 per cent safe responses on C plus plus and 76.0 per cent on Java, compared with pretrained baselines that are often weak (for example 6.1 per cent on C plus plus for CodeLlama-7B and 12.0 per cent for Meta-Llama-3-8B, with some models around 29 per cent on C plus plus before adaptation).
Generalisation to languages: GoodVibe generalises beyond C plus plus and Java, showing improved security across Swift and Go as well, with Swift and Go averages of 53.6 per cent and 54.0 per cent respectively, outperforming LoRA which yields 51.9 per cent and 48.9 per cent in the same study.
Efficiency in trainable parameters: GoodVibe requires fewer than 3 million trainable parameters per model, with an average of 1.9 million, which is far smaller than full fine tuning that updates billions of parameters and far smaller than LoRA’s average of 6.2 million.
Computational cost: GoodVibe uses about 2.4 PFLOPs for fine tuning on CodeLlama-7B-Instruct-hf, compared with 4.5 PFLOPs for full fine tuning and 8.6 PFLOPs for LoRA, representing a reduction of around 46 per cent versus full fine tuning and over 70 per cent versus LoRA.
Security performance versus full fine tuning and LoRA: GoodVibe matches or exceeds full fine tuning in many cases while using orders of magnitude fewer trainable parameters and substantially lower compute, indicating an effective efficiency advantage without sacrificing security gains.
Utility preservation: GoodVibe maintains general reasoning and language capabilities with only a small average drop of 0.84 per cent on GSM8K ARC and MMLU benchmarks, indicating limited impact on non security related tasks across languages and models.
Ablation evidence: gradient based neuron identification outperforms activation based approaches, and clustering improves training stability and efficiency by sharing updates among similar neurons; removing clustering or using per neuron updates increases parameter count and can reduce robustness.
Hyperparameter robustness: moderate clustering with a silhouette threshold near 0.05, a top k of around 50 security neurons per layer and two training epochs yields strong security improvement; more aggressive clustering or more epochs can degrade performance or increase parameters.
Interpretability and resilience questions: the approach relies on gradient signals to identify security neurons and raises questions about potential bypass or manipulation of the security subspace by adversaries or subspace shifts, suggesting areas for further resilience research.

Limitations

The evaluation relies on an automated judge model to assess security, which, while validated for reliability, may not capture all semantic or contextual security aspects. The method targets benign usage scenarios and does not address active attacks such as adversarial prompting or jailbreaks. Generalisation is demonstrated across several languages, but the breadth of programming languages and real world projects remains finite. The approach assumes the existence of a transferable security subspace and gradient based signals that may not hold for all model architectures or deployments. There is potential for overfitting to the security supervision data, particularly with aggressive clustering or excessive epochs, and the authors note the need for careful hyperparameter selection.

Why It Matters

GoodVibe offers a practical path to secure by default code generation without large scale retraining, enabling safer AI assisted coding in real world workflows. By increasing the likelihood of secure patterns emerging during generation, the method supports auditable and deployable practices in sensitive environments while preserving general coding ability. The work also highlights resilience questions around manipulating the security subspace and regards these as important directions for future research, including extending the approach to other dimensions of controllable model behaviour such as privacy and compliance and integrating with deployment time safeguards. Ethical considerations and stakeholder impacts are discussed, emphasising benefits for developers, users, and security engineers while acknowledging potential harms if the technique is misused or over relied upon without continued human review.

Attribution Original paper on arXiv