IMMUNELogo : Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

CVPR 2025

1University of Maryland    2Indian Institute of Technology Bombay   3Princeton University  4University of Colorado Boulder   5University of Central Florida 
denotes equal contribution
denotes equal advising

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks—carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak bench- marks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model’s original capabilities. For instance, against text-based jailbreak at- tacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

Algorithm

Our framework, Immune, defends against jailbreak attacks by re-framing safety alignment as an inference-time controlled decoding problem. Given an adversarial prompt xadv and a base MLLM πsafe, Immune leverages a safety-aware reward model Rsafe to steer the generation toward safe outputs.

At each decoding step t, the current state is defined as st = [xadv, y<t], where y<t denotes the sequence generated so far. The algorithm then:

  1. For t = 0, 1, …, T do:
    1. Set the current state: st = [xadv, y<t], where y<t denotes the tokens generated so far.
    2. Sample the top-k tokens from the base model:
      V̂ = { z : z ∼ πsafe(·|st) }
    3. For each token z ∈ V̂:
      1. Evaluate the safety value: Qsafe(st, z).
      2. Compute the decoding score: g(z) = (1/α) · Qsafe(st, z) + log πsafe(z|st).
    4. Construct the probability distribution over :
      f(z|st) = exp(g(z)) / ∑z′∈V̂ exp(g(z′))
    5. Sample the next token: yt ∼ f(·|st).
    6. Update the state: st+1 = [st, yt].
  2. Return the output sequence: y = [y0, y1, …, yT].

Experiment Results

The table below summarizes our evaluation of various defense strategies on multi-modal large language models using a dataset of adversarial image–text pairs. In our experiments, the image inputs are manipulated under four different conditions: random noise, images generated via Stable Diffusion (SD), natural images, and blank images. Simultaneously, the textual prompts are crafted using three distinct approaches—template‐based, persuasive, and logic‐driven—to simulate a range of adversarial scenarios.

We use the Attack Success Rate (ASR) as our primary metric, where a lower ASR indicates a stronger defense against jailbreak attacks. All evaluations are performed using the Llama-Guard-3 Jailbreak Judge as the oracle classifier. The results demonstrate that the original decoding of these models yields high ASR values, revealing significant vulnerabilities. In contrast, our proposed method, Immune, achieves markedly lower ASR values across state-of-the-art models such as LLaVA-1.6, LLaVA-1.5, MiniGPT-4-7B, MiniGPT-4-13B, and Qwen-VL, thereby validating the efficacy of our inference-time alignment approach in enhancing model safety while preserving utility.

Model Defense Strategy Noise SD Nature Blank Average
Template Persuade Logic Template Persuade Logic Template Persuade Logic Template Persuade Logic
LLaVA-1.6 Original 66.12 37.45 78.58 67.34 37.56 77.57 69.23 40.78 82.61 66.67 39.45 81.60 60.27
FigStep 61.12 39.27 62.16 62.34 40.18 54.05 63.41 35.22 56.76 61.09 38.43 52.70 51.64
AdaShield 38.42 0.00 11.08 38.13 1.56 18.13 39.29 3.22 19.14 42.78 0.48 16.12 19.23
CoCA 61.23 39.17 62.16 61.34 41.28 52.70 63.11 35.22 55.41 61.09 37.36 52.70 51.37
Ours 5.23 0.00 0.00 8.14 0.00 0.00 8.45 0.00 0.00 4.67 0.00 0.00 2.45
LLaVA-1.5 Original 58.12 39.44 76.56 61.47 38.35 74.55 58.22 42.11 77.57 59.33 40.26 74.55 56.39
FigStep 64.17 28.34 62.16 62.23 32.18 68.91 58.39 37.27 72.97 61.09 31.42 68.91 52.46
AdaShield 32.14 0.00 5.41 31.36 0.00 4.23 30.22 0.00 5.41 29.67 0.00 10.81 12.86
CoCA 59.18 26.34 35.13 58.42 30.29 35.13 49.22 32.11 28.37 59.07 23.39 37.83 39.87
Ours 9.23 0.00 0.00 8.14 0.00 0.00 1.47 0.00 0.00 5.32 0.00 0.00 2.10
MiniGPT-4-7B Original 75.23 54.12 83.62 72.44 37.33 85.63 73.18 45.27 86.64 72.09 55.42 86.64 67.15
FigStep 47.22 17.18 32.43 42.13 5.37 24.32 37.29 8.45 18.92 43.08 17.41 43.24 27.74
AdaShield 47.23 22.14 32.24 41.37 18.46 27.20 58.15 12.28 30.22 46.09 19.43 32.24 32.21
CoCA 35.18 18.27 22.97 40.34 21.14 31.08 35.29 18.42 27.03 48.11 21.36 40.54 29.74
Ours 18.23 6.12 44.59 11.34 8.27 29.72 17.18 8.45 27.02 16.09 10.37 43.24 18.34
MiniGPT-4-13B Original 74.18 47.12 90.67 80.23 53.34 85.63 79.29 52.45 83.78 79.07 53.22 87.65 70.62
FigStep 44.13 20.28 58.11 43.22 18.47 47.30 40.36 27.14 48.65 42.09 19.42 58.11 37.41
AdaShield 63.22 23.18 59.44 64.47 31.12 74.55 69.15 27.35 68.91 65.09 23.41 53.39 50.55
CoCA 35.29 21.34 36.49 40.18 22.27 40.54 35.12 20.41 35.14 48.22 21.36 52.70 33.21
Ours 25.17 18.29 59.44 29.34 19.41 57.42 29.22 23.18 58.43 23.47 23.37 58.43 32.94
Qwen-VL Original 46.12 3.27 12.09 52.23 6.18 9.07 53.34 3.42 7.05 53.11 4.36 15.11 22.99
FigStep 32.18 0.00 1.35 40.42 0.00 5.41 30.23 0.00 5.41 44.12 2.23 1.35 14.42
AdaShield 30.17 0.00 18.13 28.34 1.29 15.11 23.28 1.37 7.05 42.11 0.41 18.13 15.88
CoCA 29.12 6.27 13.51 27.34 7.18 5.41 27.29 1.42 13.51 28.47 13.36 13.51 15.69
Ours 10.27 2.18 5.41 21.34 2.29 4.03 18.22 2.35 5.41 20.17 3.37 7.05 8.58

BibTeX

@misc{ghosal2024immuneimprovingsafetyjailbreaks,
      title={Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment}, 
      author={Soumya Suvra Ghosal and Souradip Chakraborty and Vaibhav Singh and Tianrui Guan and Mengdi Wang and Ahmad Beirami and Furong Huang and Alvaro Velasquez and Dinesh Manocha and Amrit Singh Bedi},
      year={2024},
      url={https://arxiv.org/abs/2411.18688}, 
}