Train models to abstain when uncertain

Agents

Published: Mon, Nov 17, 2025 • By Adrian Calder

New research trains language models to say "I don’t know" instead of bluffing. A ternary reward scheme (correct, abstain, penalty for wrong) produces models tuned to different risk profiles and enables routing queries across models to cut confident errors. This reduces dangerous hallucinations and lowers inference cost in high‑stakes deployments.

Large Language Models (LLM) that answer everything with confidence are a security headache masquerading as convenience. The paper here takes a practical swing at that problem by teaching a model when not to answer. The method, called Reinforced Hesitation (RH), tweaks reinforcement learning from verifiable rewards so the model gets three outcomes: reward for correct answers, zero for abstaining, and a negative penalty for being wrong. That simple change turns “I don’t know” from failure into a controllable signal.

Experiments use a 1.7 billion parameter Qwen3 model on Knights and Knaves logic puzzles (80,000 training and 10,000 test examples). The team trains variants with penalties lambda in {0,1,2,5,10,20}. Results show a clear trade‑off: low penalties produce aggressive answerers, high penalties produce conservative abstainers. The penalty becomes a tuning knob that places models along a Pareto frontier for different risk regimes rather than searching for a single universally best model.

That abstention is useful at inference time. Two strategies exploit it. Cascading routes a query through a lineup of models with decreasing risk tolerance until one answers. Self‑cascading re‑queries the same model after an abstention to exploit nondeterminism. On the puzzle suite a lambda=1 model abstains on roughly 60% of hard problems and about 10% of easy ones; conditional accuracy rises and overall error falls from about 15% to under 2% compared with a baseline. A five‑model cascade (penalties 10→0) hits 88.1% accuracy with 2.2 average queries, beating majority voting at lower cost. Self‑cascading can lift a lambda=1 model from 77.5% to 92.5% given enough compute while driving final abstention under 1%.

Not everything is rosy. The study focuses on puzzles with clear ground truth and a single 1.7B model. Domain tests diverge: MedQA showed almost no abstention under penalties, while GPQA produced domain‑dependent calibration. Training dynamics also matter — high penalties can trigger a transient over‑abstention that later recalibrates. In short, abstention can be learned, but it is neither automatic nor universally transferable.

Why it matters

For security teams the payoff is straightforward. Models that know their limits cut the most damaging failure mode: confident hallucination in critical contexts such as healthcare or legal advice. Abstention also becomes an auditable signal you can monitor and route on, reducing wasted queries and giving humans a clearer handoff point.

Reduce dangerous confident errors in high‑stakes decisions.
Use risk‑tuned routing to raise accuracy while controlling cost.
But accept new attack surfaces around prompting and routing.

What to do next

If you are fielding LLMs in sensitive workflows, treat abstention as a first‑class safety control. Tune the penalty (lambda) to match your error cost, run adversarial red teams against abstention patterns, and implement authenticated, tamper‑evident routing so attackers cannot force or bypass deferrals. Monitor abstention rates and alert on anomalies, and keep humans in the loop for final decisions. RH is not a silver bullet, but it gives security teams a practical lever to trade boldness for safety and to build more auditable AI agents.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation

Authors: Mohamad Amin Mohamadi, Tianhao Wang, and Zhiyuan Li

Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, -$λ$ error) instead of binary. Controlled experiments on logic puzzles reveal that varying $λ$ produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk regime: low penalties produce aggressive answerers, high penalties conservative abstainers. We then introduce two inference strategies that exploit trained abstention as a coordination signal: cascading routes queries through models with decreasing risk tolerance, while self-cascading re-queries the same model on abstention. Both outperform majority voting with lower computational cost. These results establish abstention as a first-class training objective that transforms ``I don't know'' from failure into a coordination signal, enabling models to earn trust through calibrated honesty about their limits.

🔍 ShortSpan Analysis of the Paper

Problem

Modern language models frequently give confident answers even when they are wrong, a risk in high stake tasks. This paper studies the failure to abstain and tests methods to teach models when not to answer, turning abstention into a trustworthy coordination signal rather than a failure.

Approach

The authors introduce Reinforced Hesitation RH a modification of Reinforcement Learning from Verifiable Rewards RLVR that replaces binary rewards with a ternary scheme plus one for correct answers zero for abstention and a penalty of lambda for wrong answers. They evaluate on Knights and Knaves logic puzzles using 80 000 training and 10 000 test examples and a base model Qwen3 1.7B trained with penalties lambda in {0 1 2 5 10 20}. Training spans one epoch 312 steps with a 4096 token limit using four Nvidia H100 GPUs and the verl RLHF RLVR framework. They implement two inference strategies cascading through models with decreasing risk tolerance and self cascading re queries of the same model after abstention to convert abstention into selective questioning and collaboration while reducing computation. They compare these with majority voting baselines and report on cross model Pareto style evaluations across risk regimes.

Key Findings

Frontier models abstain extremely rarely even when penalties are explicit indicating that prompts cannot override training incentives that reward any answer over no answer.
RH creates a Pareto frontier in which different penalties yield specialised models for different risk regimes with high penalties favouring abstention and low penalties favouring aggressive answering.
Abstention becomes a coordination signal enabling inference time routing: cascading across models with decreasing risk tolerance achieves high accuracy with few queries while self cascading re queries the same model to exploit nondeterminism.
In Knights and Knaves experiments a penalty lambda of one yields abstention on about 60 percent of hard problems and about 10 percent of easy problems, with conditional accuracy rising and overall error dropping from roughly 15 percent to below 2 percent compared with a baseline.
Cascade using five models with penalties from 10 to 0 achieves 88 1 percent accuracy with 2 2 average queries outperforming voting or single models; self cascading can raise the lambda 1 model from 77 5 percent to 92 5 percent given a computational budget while abstention rates fall to under 1 percent.
Training dynamics show a transient crisis at high penalties with abrupt abstention spikes followed by recalibration; higher penalties lead to greater abstention on hard problems while easy problems see reduced abstention; response lengths shorten under higher penalties aided by a format penalty enhancing concise abstention.
Cross evaluation reveals mutual non domination no single model dominates across all risk contexts reinforcing that there is no universal best model and that abstention enables targeted trust by selecting models aligned to task risk.
MedQA shows universal abstention failure with zero abstentions under penalties, while GPQA shows domain dependent calibration with meaningful abstention for some models indicating differences across domains in how abstention is learned or applied.

Limitations

The study concentrates on Knights and Knaves puzzles with clear ground truth and uses a single 1 7B model; generalisability to subjective domains or larger architectures remains to be shown. Selecting appropriate penalty values lambda requires domain expertise and further work is needed to extend RH to other domains and to assess robustness under adversarial conditions. The authors note that objective domain specific evaluations and human oversight are important for critical applications.

Why It Matters

Abstention is proposed as a first class safety feature enabling risk aware multi model coordination and auditable reasoning in high stakes domains such as healthcare or law. Practical implications include reducing dangerous confident misinformation, lowering inference cost through cascading and self cascading strategies, and enabling safer deployment with interpretable signals when models defer. Potential attack surfaces include prompt design attempts to force abstention or route bypasses; mitigations include authenticated tamper evident routing monitoring abstention patterns domain specific penalty tuning and human oversight. Societal impact includes safer AI enabled decision support in critical settings and a shift in evaluation toward balancing error costs alongside accuracy. The practical takeaway is to treat abstention as core safety infrastructure and to integrate risk aware coordination into AI deployment pipelines.

Attribution Original paper on arXiv