Paper maps multiple refusal mechanisms in LLMs
Pentesting
Large Language Model (LLM) safety often reduces to a single rule: if a model recognises a harmful request it should refuse. That tidy view underpins many defences, but recent research complicates it. The paper under discussion presents a gradient-based approach to map internal directions and shows that refusals are not controlled by one lone axis. Instead, multiple independent directions and multi-dimensional 'concept cones' jointly determine when a model will refuse.
What the authors did
The team develops two optimisation procedures, Refusal Direction Optimisation and Refusal Cone Optimisation, to find vectors in activation space that reliably trigger refusal behaviour while minimising side effects on benign tasks. They evaluate across open models including Gemma 2, Qwen 2.5 and Llama 3, and use a balanced dataset of harmful and harmless prompts drawn from Alpaca and the SALAD benchmark. Crucially, their method includes a retain loss to preserve normal, non-harmful behaviour when interventions are applied.
Rather than finding a single, universal refusal vector, the researchers identify several distinct directions that each increase refusal likelihood. Some of these directions form low-dimensional cones up to five dimensions in larger models. The team also formalises representational independence: two directions are independent if ablating one does not change how strongly the other appears across layers, a condition stricter than mere orthogonality.
They test independence both linearly and under interventions and show that orthogonality alone does not guarantee independent mechanisms. In practice, combining representationally independent directions produces additive effects on jailbreaking success, which suggests the vectors capture different causal pathways to refusal. The paper also demonstrates that adversarial input suffixes can sometimes access these directions, enabling jailbreaks in a substantial minority of trials.
Compared with earlier methods, the gradient-based directions often achieve competitive or better jailbreaking performance while producing fewer harmful side effects on benign benchmarks such as TruthfulQA. That trade-off matters: a defensive control that breaks useful capability is of limited value to practitioners.
Why this matters for pentesting and defenders
For security teams the takeaway is practical. Treat model refusal as a structured, high-dimensional boundary, not a single gate. Red-team exercises and automated testing should probe multiple independent directions, and audits need to monitor representations across layers and tokens. Because orthogonality does not equal independence, simple checks that rely on linear decorrelation risk missing nonlinear interactions that enable bypasses.
This work also gives defenders a concrete toolset. Gradient-based representation engineering offers a way to enumerate diverse failure modes, build more exhaustive red-team suites, and design targeted interventions that restore safety with lower impact on capability. The caveats are important: the experiments use open white-box models and probe cones up to five dimensions, so deployed proprietary systems may differ and other independent mechanisms could exist.
In short, the geometry of refusal matters. Thinking in cones and representational independence moves testing from anecdote to systematic coverage. That shift is essential if governance, audits and incident response are to keep pace with how models actually represent safety-critical concepts.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
🔍 ShortSpan Analysis of the Paper
Problem
This paper investigates how refusals emerge in large language models and how the safety boundary that governs refusals can be understood and tested. It challenges the view that a single refusal direction in the model’s activation space controls whether a model refuses harmful requests, and shows that refusals arise from multiple independent directions within complex spatial structures. The work aims to inform safer design, rigorous safety testing, and risk assessment for AI systems.
Approach
The authors introduce a gradient based representation engineering framework to identify refusal directions, define representational independence, and reveal multi dimensional concept cones that mediate refusal. They develop Refusal Direction Optimisation and Refusal Cone Optimisation to discover directions and cones that satisfy monotonic scaling and surgical ablation properties, while minimising side effects on harmless inputs. A dataset of harmful and harmless prompts is constructed from Alpaca and SALAD Benchmark with 1184 training and 128 validation harmful prompts, plus equal numbers of harmless prompts and 128 harmless test prompts. Evaluations span model families including Gemma 2, Qwen 2 5 and Llama 3. The method can operate at flexible layers and tokens, and uses a retain loss to preserve benign behaviour during interventions. The work also extends to higher dimensional cones and tests for independence across directions, including orthogonality and nonlinear interactions.
Key Findings
- Refusal behaviour in LLMs is mediated by multiple independent directions rather than a single linear axis, organised into multi dimensional cones where all directions contribute to refusal.
- Representational independence is defined to capture both linear and nonlinear interactions; two directions are representationally independent if ablating one does not change how strongly the other is expressed in activations across layers.
- Gradient based methods identify direction sets that achieve refusals with greater precision and fewer unintended side effects than prior approaches, and perform well across several model families.
- Refusal cones exist with dimensions up to five in tested models; larger models tend to support higher dimensional cones, while cone performance degrades with higher dimension in smaller models.
- Combinations of RepInd directions yield additive improvements in attack success, indicating they reflect distinct mechanisms rather than simply different manifestations of a single process.
- Access from input to RepInd directions is possible in part through adversarial suffixes, with GCG based suffixes able to reduce RepInd representations in some cases and enable jailbreaks in a substantial minority of trials (around one third in the reported results).
- RDO directions achieve competitive or superior jailbreaking performance compared to the DIM baseline, especially when using activation addition or directional ablation, and typically impose fewer negative impacts on benign benchmarks such as TruthfulQA.
- Across multiple benchmarks including GSM8K, MMLU, ARC-Challenge and TruthfulQA, RDO directions show lower side effects and often Pareto dominance over the DIM baseline, indicating safer control of refusals with preserved capability.
- Orthogonality alone does not guarantee independence; ablation experiments reveal cross layer and nonlinear interactions between directions, motivating the stronger RepInd criterion.
- The concept cone framework offers a practical way to visualise and monitor safety boundaries, and the gradient based approach can be extended to other safety relevant concepts beyond refusal.
Limitations
The refusal directions are optimised on the same set of targets, which may overlook additional independent mechanisms. The study uses open source white box models, and while informative for defensive design, results may not fully transfer to deployed proprietary systems. Higher dimensional cones are demonstrated up to dimension five, but the existence of other undiscovered regions remains possible. Adversarial suffix findings indicate potential input based pathways to RepInd directions, underscoring remaining security considerations.
Why It Matters
The findings highlight that safety boundaries in LLMs are structured in high dimensional spaces and governed by multiple interacting mechanisms. This has practical implications for safety testing and red team evaluations, as probing multiple independent directions improves coverage of potential failure modes. Representational cones provide a tangible visualization for auditing robustness, while representational independence informs defence design by emphasising distinct mechanisms to monitor and guard. The gradient based framework is adaptable to other safety critical concepts, potentially guiding ongoing enhancements to alignment, monitoring and model governance.