Tag: Vision Foundation Model

All the articles with the tag "Vision Foundation Model".

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Published: 8 May, 2025 at 11:07 AM

87.75 🤔

ASTRA introduces an efficient defense for Vision Language Models by adaptively steering activations away from adversarial directions using image attribution, achieving state-of-the-art performance in mitigating jailbreak attacks with minimal impact on benign utility and high inference efficiency.
Task-Core Memory Management and Consolidation for Long-term Continual Learning

Published: 17 May, 2025 at 11:01 AM

85.53 🤔

This paper introduces Long-CL, a human memory-inspired framework for long-term continual learning, leveraging task-core memory management and selective sample consolidation to significantly outperform baselines by 7.4% and 6.5% AP on two novel benchmarks, MMLongCL-Bench and TextLongCL-Bench, while mitigating catastrophic forgetting.
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Published: 19 May, 2025 at 11:16 AM

73.67 🤔

本文提出MMRL及MMRL++框架，通过共享表示空间和解耦策略增强视觉-语言模型的少样本适配能力，并利用参数高效的SRRA和PRC机制提升泛化性和训练稳定性，在多个数据集上取得最优性能。
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Published: 9 May, 2025 at 11:08 AM

70.12 🤔

This paper introduces VLM Q-Learning, an offline-to-online reinforcement learning method that fine-tunes Vision-Language Models for interactive decision-making by filtering suboptimal actions with a critic head, achieving significant performance improvements over supervised fine-tuning across multiple multimodal agent tasks.
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Published: 10 May, 2025 at 10:59 AM

68.75 🤔

This paper introduces MM-Escape, a benchmark using the customizable 3D environment EscapeCraft to evaluate multimodal reasoning in MLLMs through room escape tasks, revealing that while models like GPT-4o achieve high success in simple scenarios, performance drops significantly with increased difficulty, exposing distinct limitations in reasoning and spatial awareness.

Tag: Vision Foundation Model

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Task-Core Memory Management and Consolidation for Long-term Continual Learning

MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game