Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosa¹, Souradip Chakroborty¹, Vaibhav Singh², Furong Huang¹, Dinesh Manocha¹, Amrit Singh Bedi³

¹University of Maryland, College Park · ²Indian Institute of Technology, Bombay · ³University of Central Florida ·

SafeThink is an inference-time safety defense: monitor step-by-step safety, trigger only on violations, and steer minimally — often within the first 1–3 reasoning steps.

Figure 1: SafeThink overview — (Left) SafeThink monitors the reasoning trace and adds a tiny safety steer only when needed—usually fixing safety in the first few steps.
(Right) Reasoning fine-tuning boosts accuracy but hurts safety (higher ASR); SafeThink restores safety at inference-time without losing reasoning.

Problem we solve

Reasoning boosts capability — but can also increase jailbreak vulnerability by introducing a reasoning tax on safety: fine-tuned reasoning traces can drift into unsafe steps, raising ASR even when the base model is safer.

Pain point

Safety drops during intermediate reasoning steps.

Key idea

Satisficing safety: meet a safety threshold, then stop intervening.

Monitor

score step safety

Trigger

only on violations

Steer

few early steps

Why it matters

No retraining — deploy at inference-time.
Minimal changes — short steering prefix, early only.
Preserve utility — avoid blunt truncation / refusal spam.

Takeaway

Safety recovery is often “a few early steps away”.

Abstract

Reinforcement learning (RL) post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs), but recent evidence shows it can also degrade safety alignment and increase jailbreak success rates.

We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix (“Wait, think safely”) only when the safety threshold is violated.

Across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30–60% (e.g., LlamaV-o1: 63.33%→5.74% on JailbreakV-28K, R1-Onevision: 69.07%→5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20%→65.00%).

A key empirical finding is that safety recovery is often only a few steering steps away: intervening in the first 1–3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Method

SafeThink in three steps

SafeThink treats safety recovery as a satisficing constraint: intervene only until the reasoning trace becomes safe, then continue normally to preserve utility.

1
Monitor
Track safety during generation by scoring the partial reasoning trace with a safety reward model after each step/token.
2
Trigger
If safety falls below a threshold τ, trigger a lightweight correction. Otherwise, keep decoding with no changes.
3
Steer (early, minimal)
Inject a short corrective prefix (e.g., “Wait, think safely”) and resample the next step(s) until the safety score recovers (often within 1–3 steps).

Algorithm: SafeThink (early-step satisficing steering)

Inputs: query x, model π, safety scorer R_safe, threshold τ, steer prefix s, 
  max early steps m, max resamples B trace z ← ∅

for t = 1..T:
  y ~ π(· | x, z)                      # propose next step/token
  if R_safe(x, z ⊕ y) ≥ τ:             # safe → accept
    z ← z ⊕ y
  else if t ≤ m:                       # unsafe early → steer minimally
    for b = 1..B:
      y' ~ π(· | x, z, s)              # inject "Wait, think safely"
      if R_safe(x, z ⊕ y') ≥ τ: break  # satisficing: first safe wins
    z ← z ⊕ y'
  else:
    z ← z ⊕ y                          # optional: no late steering

return π(· | x, z)                      # final answer from (recovered) trace

Results

Dataset tabs, followed by fixed MathVista utility.

Figure 6: Hades results — Results on Hades.

Figure 7: FigStep results — Results on FigStep.

Utility (MathVista)

MathVista utility results — MathVista: utility preservation (replace caption text to match paper).

BibTeX

Copy and paste:

@inproceedings{safethink2026,
  title     = {Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away},
  author    = {Ghosal, Soumya Suvra and Chakraborty, Souradip and Singh, Vaibhav and Huang, Furong and Manocha, Dinesh and Bedi, Amrit Singh},
  booktitle = {arxiv},
  year      = {2026}
}