Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
1University of Maryland, College Park · 2Indian Institute of Technology, Bombay · 3University of Central Florida ·
SafeThink is an inference-time safety defense: monitor step-by-step safety, trigger only on violations, and steer minimally — often within the first 1–3 reasoning steps.
(Right) Reasoning fine-tuning boosts accuracy but hurts safety (higher ASR); SafeThink restores safety at inference-time without losing reasoning.
Reasoning boosts capability — but can also increase jailbreak vulnerability by introducing a reasoning tax on safety: fine-tuned reasoning traces can drift into unsafe steps, raising ASR even when the base model is safer.
Satisficing safety: meet a safety threshold, then stop intervening.
- No retraining — deploy at inference-time.
- Minimal changes — short steering prefix, early only.
- Preserve utility — avoid blunt truncation / refusal spam.
Abstract
Reinforcement learning (RL) post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs), but recent evidence shows it can also degrade safety alignment and increase jailbreak success rates.
We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix (“Wait, think safely”) only when the safety threshold is violated.
Across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30–60% (e.g., LlamaV-o1: 63.33%→5.74% on JailbreakV-28K, R1-Onevision: 69.07%→5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20%→65.00%).
A key empirical finding is that safety recovery is often only a few steering steps away: intervening in the first 1–3 reasoning steps typically suffices to redirect the full generation toward safe completions.
Method
SafeThink in three steps
SafeThink treats safety recovery as a satisficing constraint: intervene only until the reasoning trace becomes safe, then continue normally to preserve utility.
-
1
MonitorTrack safety during generation by scoring the partial reasoning trace with a safety reward model after each step/token.
-
2
TriggerIf safety falls below a threshold τ, trigger a lightweight correction. Otherwise, keep decoding with no changes.
-
3
Steer (early, minimal)Inject a short corrective prefix (e.g., “Wait, think safely”) and resample the next step(s) until the safety score recovers (often within 1–3 steps).
Algorithm: SafeThink (early-step satisficing steering)
Inputs: query x, model π, safety scorer R_safe, threshold τ, steer prefix s,
max early steps m, max resamples B trace z ← ∅
for t = 1..T:
y ~ π(· | x, z) # propose next step/token
if R_safe(x, z ⊕ y) ≥ τ: # safe → accept
z ← z ⊕ y
else if t ≤ m: # unsafe early → steer minimally
for b = 1..B:
y' ~ π(· | x, z, s) # inject "Wait, think safely"
if R_safe(x, z ⊕ y') ≥ τ: break # satisficing: first safe wins
z ← z ⊕ y'
else:
z ← z ⊕ y # optional: no late steering
return π(· | x, z) # final answer from (recovered) trace
Results
Dataset tabs, followed by fixed MathVista utility.
Utility (MathVista)
BibTeX
@inproceedings{safethink2026,
title = {Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away},
author = {Ghosal, Soumya Suvra and Chakraborty, Souradip and Singh, Vaibhav and Huang, Furong and Manocha, Dinesh and Bedi, Amrit Singh},
booktitle = {arxiv},
year = {2026}
}