Fixing LLM Over-refusal Without Breaking Safety
Defenses
Large Language Model (LLM) safety systems are supposed to stop harm. In practice they often over-refuse: legitimate users get blocked, workflows fail, and people start trying to outsmart the filter. The paper under discussion analyses that failure mode from a representation perspective and offers a practical mitigation called MOSR. The core finding is simple and worrying. Many of the prompts that get falsely refused live on the boundary in representation space between benign and malicious inputs. That makes the boundary a natural probe target for attackers and a source of real friction for operators.
What MOSR is and why it matters
MOSR intervenes during safety alignment rather than changing application logic at runtime. It combines two moves. First, Overlap-Aware Loss Weighting reduces the erasure pressure on malicious samples that look too similar to benign or borderline prompts. Second, Context-Aware Augmentation adds harmful prefixes to rejection training so the model learns a fuller context for why it should refuse. The authors apply this with representation probes and fine-tune using LoRA on mid to late layers of tested models including Llama3 8B Instruct and Mistral 7B Instruct v0.2, and show reductions in over-refusal while largely preserving standard safety metrics.
From an operational security perspective the representation-boundary observation is the headline risk. If safety decisions hinge on a narrow band in vector space, an attacker can craft prompts that sit just inside or outside that band. That affects model endpoints, vector stores, and any component that relies on representation distance or probes. It also affects user experience: legitimate queries become a testbed for adversarial probing when people repeatedly receive unexplained refusals.
Run-book mitigations for SREs and security teams
There is no single magic fix. MOSR is a training-time approach that reduces over-refusal at source, and it should be treated as one tool in the defence stack. Immediate operational steps you can take now include improving telemetry, hardening endpoints, and treating safety decisions as observable signals rather than opaque blocks. Instrument rejection events with the input representation vector or a stable hash, log preceding context, and flag repeated borderline refusals from specific clients or tokens. Use rate limits and canary deployments when rolling out any retrained safety model.
Be aware of the new attack surface introduced by Context-Aware Augmentation. If your system prepends harmful prefixes into rejection training, make sure the reject rationale does not leak those prefixes back to users or logs in a way that reveals policy heuristics. That could help attackers map the boundary instead of masking it.
- Audit recent rejection spikes and capture representation snapshots for investigation.
- Apply throttles and require secondary verification when users trigger repeated refusals.
- Plan a LoRA or fine-tune experiment in a staging environment with rollback metrics for both safety and over-refusal.
MOSR moves the dial on a real problem: filters that are too cautious. It is not a final answer. The approach trades on careful hyperparameter choices and focuses on single-turn jailbreaks. Expect some tuning, monitoring, and a test harness that measures both attack success rates and false refusals. For teams running LLMs in production, that balance between safety and usability is worth the engineering effort; leaving it unaddressed is how you get both angry users and a predictable probe surface for attackers.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
🔍 ShortSpan Analysis of the Paper
Problem
Large language models safety filters can over refuse benign prompts, harming usability. The paper analyses this over-refusal from a representation perspective and finds that over-refusal samples sit at the boundary between benign and malicious representations, making it easy for attackers to probe and exploit safety boundaries. The goal is to reduce false refusals while preserving safety, improving practical usability of aligned LLMs.
Approach
The authors propose MOSR, a mitigation framework that intervenes during safety alignment in the representation space. It has two novel components: Overlap-Aware Loss Weighting which assigns lower erasure weights to malicious samples that resemble over-refusal prompts by measuring their cosine similarity to a representation of over-refusal samples; and Context-Aware Augmentation which extends the context for rejection decisions by prepending harmful prefixes before the rejection response. They train with representation based probes, using three data sources and two LLMs, fine tuning with LoRA on mid to late layers, and evaluate safety and over-refusal with diverse benchmarks. Algorithm 1 formalises the training, combining an erase loss to reduce harmful representations while a retain loss keeps safety alignment aligned with the original model, and CAE to extend context for refusals. Datasets include WildGuardMix, WildJailbreak and UltraChat for general capabilities, and an over-refusal set from OR-Bench. Models tested are Llama3 8B Instruct and Mistral 7B Instruct v0.2.
Key Findings
- Over-refusal arises from representation boundary where benign and malicious prompts overlap; LLMs can distinguish safety in normal prompts but over-refusal samples are misaligned.
- MOSR with Overlap-Aware Loss Weighting and Context-Aware Augmentation reduces over-refusal while largely preserving safety across models and benchmarks.
- Ablation shows both components contribute; combining both yields the best balance with a competitive safety profile and acceptance of benign prompts.
- Across models such as Llama3 8B Instruct and Mistral 7B Instruct, the method improves balance between safety and usability; safety evaluation uses standard attack based ASR scores and over-refusal uses dedicated benchmarks, while general capabilities remain comparable to the base models.
Limitations
Balancing safety and over-refusal remains challenging because pseudo-harmful prompts lie near the boundary with benign prompts, causing overlap that is difficult to erase without harming usefulness. Context aware augmentation can slightly reduce safety in some cases; hyperparameters such as context length and temperature influence results. The study focuses on single round jailbreak settings; multi turn over-refusal scenarios are left for future work. Training costs and reliance on LoRA are considerations.
Why It Matters
Practically, the approach reduces user frustration and discourages attempts to bypass safety, supporting safer deployment in sensitive sectors. Security implications include recognising boundary risk in safety decisions and an attack surface in the context aware augmentation step that could be misused to alter rejection rationales or content. Societal impact centres on enhancing trust and safety usability rather than addressing broader issues such as autonomous weapons or mass surveillance.