Tag: Vision Foundation Model
All the articles with the tag "Vision Foundation Model".
-
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
本文提出了一种基于视觉-语言模型的定义引导提示技术和UnHateMeme框架,用于检测和缓解多模态模因中的仇恨内容,通过零样本和少样本提示实现高效检测,并生成非仇恨替代内容以保持图像-文本一致性,在实验中展现出显著效果。
-
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
ASTRA introduces an efficient defense for Vision Language Models by adaptively steering activations away from adversarial directions using image attribution, achieving state-of-the-art performance in mitigating jailbreak attacks with minimal impact on benign utility and high inference efficiency.
-
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making
This paper introduces VLM Q-Learning, an offline-to-online reinforcement learning method that fine-tunes Vision-Language Models for interactive decision-making by filtering suboptimal actions with a critic head, achieving significant performance improvements over supervised fine-tuning across multiple multimodal agent tasks.
-
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
This paper introduces MM-Escape, a benchmark using the customizable 3D environment EscapeCraft to evaluate multimodal reasoning in MLLMs through room escape tasks, revealing that while models like GPT-4o achieve high success in simple scenarios, performance drops significantly with increased difficulty, exposing distinct limitations in reasoning and spatial awareness.