IMMUNE

CVPR 2025

¹University of Maryland ²Indian Institute of Technology Bombay ³Princeton University ⁴University of Colorado Boulder ⁵University of Central Florida

∗ denotes equal contribution
† denotes equal advising

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks—carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak bench- marks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model’s original capabilities. For instance, against text-based jailbreak at- tacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

Algorithm

Our framework, Immune, defends against jailbreak attacks by re-framing safety alignment as an inference-time controlled decoding problem. Given an adversarial prompt x_adv and a base MLLM π_safe, Immune leverages a safety-aware reward model R_safe to steer the generation toward safe outputs.

At each decoding step t, the current state is defined as s_t = [x_adv, y_<t], where y<t denotes the sequence generated so far. The algorithm then:

            
                For t = 0, 1, …, T do:
                
                    Set the current state: st = [xadv, y<t], where y<t denotes the tokens generated so far.
                  
                    Sample the top-k tokens from the base model:
                    
                    V̂ = { z : z ∼ πsafe(·|st) }
                  
                    For each token z ∈ V̂:
                    
                        Evaluate the safety value: Qsafe(st, z).
                      
                        Compute the decoding score: g(z) = (1/α) · Qsafe(st, z) + log πsafe(z|st).
                      
                    Construct the probability distribution over V̂:
                    
                    f(z|st) = exp(g(z)) / ∑z′∈V̂ exp(g(z′))
                  
                    Sample the next token: yt ∼ f(·|st).
                  
                    Update the state: st+1 = [st, yt].
                  
                Return the output sequence: y = [y0, y1, …, yT].

Experiment Results

The table below summarizes our evaluation of various defense strategies on multi-modal large language models using a dataset of adversarial image–text pairs. In our experiments, the image inputs are manipulated under four different conditions: random noise, images generated via Stable Diffusion (SD), natural images, and blank images. Simultaneously, the textual prompts are crafted using three distinct approaches—template‐based, persuasive, and logic‐driven—to simulate a range of adversarial scenarios.

We use the Attack Success Rate (ASR) as our primary metric, where a lower ASR indicates a stronger defense against jailbreak attacks. All evaluations are performed using the Llama-Guard-3 Jailbreak Judge as the oracle classifier. The results demonstrate that the original decoding of these models yields high ASR values, revealing significant vulnerabilities. In contrast, our proposed method, Immune, achieves markedly lower ASR values across state-of-the-art models such as LLaVA-1.6, LLaVA-1.5, MiniGPT-4-7B, MiniGPT-4-13B, and Qwen-VL, thereby validating the efficacy of our inference-time alignment approach in enhancing model safety while preserving utility.

Model	Defense Strategy	Noise			SD			Nature			Blank			Average
Model	Defense Strategy	Template	Persuade	Logic	Template	Persuade	Logic	Template	Persuade	Logic	Template	Persuade	Logic	Average
LLaVA-1.6	Original	66.12	37.45	78.58	67.34	37.56	77.57	69.23	40.78	82.61	66.67	39.45	81.60	60.27
	FigStep	61.12	39.27	62.16	62.34	40.18	54.05	63.41	35.22	56.76	61.09	38.43	52.70	51.64
	AdaShield	38.42	0.00	11.08	38.13	1.56	18.13	39.29	3.22	19.14	42.78	0.48	16.12	19.23
	CoCA	61.23	39.17	62.16	61.34	41.28	52.70	63.11	35.22	55.41	61.09	37.36	52.70	51.37
	Ours	5.23	0.00	0.00	8.14	0.00	0.00	8.45	0.00	0.00	4.67	0.00	0.00	2.45
LLaVA-1.5	Original	58.12	39.44	76.56	61.47	38.35	74.55	58.22	42.11	77.57	59.33	40.26	74.55	56.39
	FigStep	64.17	28.34	62.16	62.23	32.18	68.91	58.39	37.27	72.97	61.09	31.42	68.91	52.46
	AdaShield	32.14	0.00	5.41	31.36	0.00	4.23	30.22	0.00	5.41	29.67	0.00	10.81	12.86
	CoCA	59.18	26.34	35.13	58.42	30.29	35.13	49.22	32.11	28.37	59.07	23.39	37.83	39.87
	Ours	9.23	0.00	0.00	8.14	0.00	0.00	1.47	0.00	0.00	5.32	0.00	0.00	2.10
MiniGPT-4-7B	Original	75.23	54.12	83.62	72.44	37.33	85.63	73.18	45.27	86.64	72.09	55.42	86.64	67.15
	FigStep	47.22	17.18	32.43	42.13	5.37	24.32	37.29	8.45	18.92	43.08	17.41	43.24	27.74
	AdaShield	47.23	22.14	32.24	41.37	18.46	27.20	58.15	12.28	30.22	46.09	19.43	32.24	32.21
	CoCA	35.18	18.27	22.97	40.34	21.14	31.08	35.29	18.42	27.03	48.11	21.36	40.54	29.74
	Ours	18.23	6.12	44.59	11.34	8.27	29.72	17.18	8.45	27.02	16.09	10.37	43.24	18.34
MiniGPT-4-13B	Original	74.18	47.12	90.67	80.23	53.34	85.63	79.29	52.45	83.78	79.07	53.22	87.65	70.62
	FigStep	44.13	20.28	58.11	43.22	18.47	47.30	40.36	27.14	48.65	42.09	19.42	58.11	37.41
	AdaShield	63.22	23.18	59.44	64.47	31.12	74.55	69.15	27.35	68.91	65.09	23.41	53.39	50.55
	CoCA	35.29	21.34	36.49	40.18	22.27	40.54	35.12	20.41	35.14	48.22	21.36	52.70	33.21
	Ours	25.17	18.29	59.44	29.34	19.41	57.42	29.22	23.18	58.43	23.47	23.37	58.43	32.94
Qwen-VL	Original	46.12	3.27	12.09	52.23	6.18	9.07	53.34	3.42	7.05	53.11	4.36	15.11	22.99
	FigStep	32.18	0.00	1.35	40.42	0.00	5.41	30.23	0.00	5.41	44.12	2.23	1.35	14.42
	AdaShield	30.17	0.00	18.13	28.34	1.29	15.11	23.28	1.37	7.05	42.11	0.41	18.13	15.88
	CoCA	29.12	6.27	13.51	27.34	7.18	5.41	27.29	1.42	13.51	28.47	13.36	13.51	15.69
	Ours	10.27	2.18	5.41	21.34	2.29	4.03	18.22	2.35	5.41	20.17	3.37	7.05	8.58

BibTeX

@InProceedings{Ghosal_2025_CVPR, author = {Ghosal, Soumya Suvra and Chakraborty, Souradip and Singh, Vaibhav and Guan, Tianrui and Wang, Mengdi and Beirami, Ahmad and Huang, Furong and Velasquez, Alvaro and Manocha, Dinesh and Bedi, Amrit Singh}, title = {Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {25038-25049} }

IMMUNE : Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

CVPR 2025

IMMUNE: SOTA defense against jailbreak attacks for Multimodal Language Models.

Abstract

Algorithm

Experiment Results

BibTeX