Fixing LLM Over-refusal Without Breaking Safety

Defenses

Published: Tue, Nov 25, 2025 • By Lydia Stratus

Fixing LLM Over-refusal Without Breaking Safety

Research analyses why large language model (LLM) safety filters wrongly refuse benign prompts and proposes MOSR, a representation-level defence that reduces over-refusal. MOSR reweights boundary examples and augments rejection context during training, restoring usability while largely keeping safety. That lowers user frustration and the incentive to probe safety boundaries.

Large Language Model (LLM) safety systems are supposed to stop harm. In practice they often over-refuse: legitimate users get blocked, workflows fail, and people start trying to outsmart the filter. The paper under discussion analyses that failure mode from a representation perspective and offers a practical mitigation called MOSR. The core finding is simple and worrying. Many of the prompts that get falsely refused live on the boundary in representation space between benign and malicious inputs. That makes the boundary a natural probe target for attackers and a source of real friction for operators.

What MOSR is and why it matters

MOSR intervenes during safety alignment rather than changing application logic at runtime. It combines two moves. First, Overlap-Aware Loss Weighting reduces the erasure pressure on malicious samples that look too similar to benign or borderline prompts. Second, Context-Aware Augmentation adds harmful prefixes to rejection training so the model learns a fuller context for why it should refuse. The authors apply this with representation probes and fine-tune using LoRA on mid to late layers of tested models including Llama3 8B Instruct and Mistral 7B Instruct v0.2, and show reductions in over-refusal while largely preserving standard safety metrics.

From an operational security perspective the representation-boundary observation is the headline risk. If safety decisions hinge on a narrow band in vector space, an attacker can craft prompts that sit just inside or outside that band. That affects model endpoints, vector stores, and any component that relies on representation distance or probes. It also affects user experience: legitimate queries become a testbed for adversarial probing when people repeatedly receive unexplained refusals.

Run-book mitigations for SREs and security teams

There is no single magic fix. MOSR is a training-time approach that reduces over-refusal at source, and it should be treated as one tool in the defence stack. Immediate operational steps you can take now include improving telemetry, hardening endpoints, and treating safety decisions as observable signals rather than opaque blocks. Instrument rejection events with the input representation vector or a stable hash, log preceding context, and flag repeated borderline refusals from specific clients or tokens. Use rate limits and canary deployments when rolling out any retrained safety model.

Be aware of the new attack surface introduced by Context-Aware Augmentation. If your system prepends harmful prefixes into rejection training, make sure the reject rationale does not leak those prefixes back to users or logs in a way that reveals policy heuristics. That could help attackers map the boundary instead of masking it.

Audit recent rejection spikes and capture representation snapshots for investigation.
Apply throttles and require secondary verification when users trigger repeated refusals.
Plan a LoRA or fine-tune experiment in a staging environment with rollback metrics for both safety and over-refusal.

MOSR moves the dial on a real problem: filters that are too cautious. It is not a final answer. The approach trades on careful hyperparameter choices and focuses on single-turn jailbreaks. Expect some tuning, monitoring, and a test harness that measures both attack success rates and false refusals. For teams running LLMs in production, that balance between safety and usability is worth the engineering effort; leaving it unaddressed is how you get both angry users and a predictable probe surface for attackers.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Authors: Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, and Wen Jiang

Large language models demonstrate powerful capabilities across various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements in model safety often come at the cost of severe over-refusal, failing to strike a good balance between safety and usability. In this paper, we first analyze the causes of over-refusal from a representation perspective, revealing that over-refusal samples reside at the boundary between benign and malicious samples. Based on this, we propose MOSR, designed to mitigate over-refusal by intervening the safety representation of LLMs. MOSR incorporates two novel components: (1) Overlap-Aware Loss Weighting, which determines the erasure weight for malicious samples by quantifying their similarity to pseudo-malicious samples in the representation space, and (2) Context-Aware Augmentation, which supplements the necessary context for rejection decisions by adding harmful prefixes before rejection responses. Experiments demonstrate that our method outperforms existing approaches in mitigating over-refusal while largely maintaining safety. Overall, we advocate that future defense methods should strike a better balance between safety and over-refusal.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models safety filters can over refuse benign prompts, harming usability. The paper analyses this over-refusal from a representation perspective and finds that over-refusal samples sit at the boundary between benign and malicious representations, making it easy for attackers to probe and exploit safety boundaries. The goal is to reduce false refusals while preserving safety, improving practical usability of aligned LLMs.

Approach

The authors propose MOSR, a mitigation framework that intervenes during safety alignment in the representation space. It has two novel components: Overlap-Aware Loss Weighting which assigns lower erasure weights to malicious samples that resemble over-refusal prompts by measuring their cosine similarity to a representation of over-refusal samples; and Context-Aware Augmentation which extends the context for rejection decisions by prepending harmful prefixes before the rejection response. They train with representation based probes, using three data sources and two LLMs, fine tuning with LoRA on mid to late layers, and evaluate safety and over-refusal with diverse benchmarks. Algorithm 1 formalises the training, combining an erase loss to reduce harmful representations while a retain loss keeps safety alignment aligned with the original model, and CAE to extend context for refusals. Datasets include WildGuardMix, WildJailbreak and UltraChat for general capabilities, and an over-refusal set from OR-Bench. Models tested are Llama3 8B Instruct and Mistral 7B Instruct v0.2.

Key Findings

Over-refusal arises from representation boundary where benign and malicious prompts overlap; LLMs can distinguish safety in normal prompts but over-refusal samples are misaligned.
MOSR with Overlap-Aware Loss Weighting and Context-Aware Augmentation reduces over-refusal while largely preserving safety across models and benchmarks.
Ablation shows both components contribute; combining both yields the best balance with a competitive safety profile and acceptance of benign prompts.
Across models such as Llama3 8B Instruct and Mistral 7B Instruct, the method improves balance between safety and usability; safety evaluation uses standard attack based ASR scores and over-refusal uses dedicated benchmarks, while general capabilities remain comparable to the base models.

Limitations

Balancing safety and over-refusal remains challenging because pseudo-harmful prompts lie near the boundary with benign prompts, causing overlap that is difficult to erase without harming usefulness. Context aware augmentation can slightly reduce safety in some cases; hyperparameters such as context length and temperature influence results. The study focuses on single round jailbreak settings; multi turn over-refusal scenarios are left for future work. Training costs and reliance on LoRA are considerations.

Why It Matters

Practically, the approach reduces user frustration and discourages attempts to bypass safety, supporting safer deployment in sensitive sectors. Security implications include recognising boundary risk in safety decisions and an attack surface in the context aware augmentation step that could be misused to alter rejection rationales or content. Societal impact centres on enhancing trust and safety usability rather than addressing broader issues such as autonomous weapons or mass surveillance.

Attribution Original paper on arXiv