Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

ASTRA introduces an efficient defense for Vision Language Models by adaptively steering activations away from adversarial directions using image attribution, achieving state-of-the-art performance in mitigating jailbreak attacks with minimal impact on benign utility and high inference efficiency.

Vision Foundation Model, Safety, Robustness, Multimodal Systems, Adaptive Systems

Han Wang, Gang Wang, Huan Zhang

University of Illinois Urbana-Champaign

Generated by grok-3

Background Problem

Vision Language Models (VLMs) integrate visual and textual inputs to achieve remarkable cognition capabilities, yet they are vulnerable to jailbreak attacks that exploit visual inputs to generate harmful content. These attacks, categorized as perturbation-based (adversarial image modifications) and structured-based (malicious content embedded via typography), expose new attack surfaces beyond traditional language model limitations. Existing defenses like input preprocessing, adversarial training, and response evaluation are computationally expensive and impractical for real-world deployment, often requiring multiple inference passes or extensive retraining. The key problem addressed in this work is the need for an efficient, effective defense mechanism that can resist VLM jailbreak attacks without significant computational overhead or performance degradation on benign inputs.

Method

The proposed method, ASTRA, focuses on adaptively steering VLMs away from adversarial feature directions using image attribution activations to resist jailbreak attacks. It operates in two main steps:

Constructing Steering Vectors via Image Attribution: Adversarial images are generated (e.g., using Projected Gradient Descent, PGD), and visual tokens most associated with jailbreaks are identified through random ablation and a linear surrogate model (Lasso). This model estimates the impact of token inclusion/exclusion on jailbreak probability, selecting top-k impactful tokens to build steering vectors that represent harmful response directions.
Adaptive Activation Steering at Inference Time: During inference, activations are manipulated by projecting calibrated activations against steering vectors, applying a steering coefficient only when harmful directions are detected (using a max function to avoid impact on benign inputs). Calibration activation, computed as the average activation over diverse inputs, ensures accurate distinction between harmful and benign activations. This approach avoids retraining or multiple inference passes, aiming for efficiency and minimal utility degradation on benign inputs while strongly mitigating harmful outputs under adversarial conditions.

Experiment

The experiments evaluate ASTRA across three VLMs (MiniGPT-4, Qwen2-VL, LLaVA-v1.5) using datasets like ImageNet for adversarial image generation (via PGD with varying perturbation radii) and benchmarks like RealToxicityPrompt, Advbench, and MM-SafetyBench for toxicity and jailbreak assessments. The setup compares ASTRA against VLM defenses (e.g., JailGuard, Self-reminder) and LLM steering methods, testing on perturbation-based and structured-based attacks, as well as transferability to unseen attacks (e.g., text-only, PGD variants). Metrics include Toxicity Score (via Detoxify classifier) and Attack Success Rate (ASR via HarmBench classifier). Results show ASTRA significantly outperforms baselines, achieving up to 12.12% lower Toxicity Score and 17.84% lower ASR than JailGuard on MiniGPT-4, with 9x faster inference due to single-pass processing. Transferability is strong in in-distribution and some out-of-distribution scenarios, though performance varies with attack intensity and type. Utility on benign inputs (evaluated via MM-Vet, MMBench) shows minimal degradation, validating adaptive steering. However, the experimental design heavily focuses on PGD-based attacks, potentially limiting generalizability, and adaptive attack results indicate reduced but non-zero vulnerability when attackers know the defense mechanism. The setup is comprehensive for the targeted attacks but could benefit from broader attack diversity and real-world deployment testing.

Further Thoughts

ASTRA’s approach to steering activations in VLMs opens up intriguing possibilities for broader applications in multimodal AI safety. The concept of isolating harmful feature directions via image attribution could potentially extend to other domains like audio-visual models or even robotics, where multimodal inputs create similar vulnerabilities. However, a deeper investigation into the non-linear dynamics of VLM feature spaces might reveal limitations in the current linear surrogate model; exploring neural network-based attribution methods could enhance accuracy. Additionally, connecting this work to recent advancements in reinforcement learning from human feedback (RLHF) for alignment, one might consider integrating steering vectors with RLHF to dynamically adapt defenses based on real-time user interactions. The transferability results, while promising, suggest a need for meta-learning approaches to further generalize across unseen attack types, potentially drawing inspiration from meta-learning frameworks in adversarial robustness research. Finally, the efficiency of ASTRA is notable, but real-world deployment challenges, such as maintaining steering vector relevance over time as attack strategies evolve, warrant longitudinal studies or online learning adaptations.