Tag: Robustness

All the articles with the tag "Robustness".

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Published: 8 May, 2025 at 11:07 AM

87.75 🤔

ASTRA introduces an efficient defense for Vision Language Models by adaptively steering activations away from adversarial directions using image attribution, achieving state-of-the-art performance in mitigating jailbreak attacks with minimal impact on benign utility and high inference efficiency.
How much do language models memorize?

Published: 3 Jun, 2025 at 11:44 AM

87.61 🤔

本文提出了一种基于信息论的记忆量化方法，通过区分无意记忆和泛化，测量GPT风格语言模型的容量约为每个参数3.6比特，并揭示了数据集规模与模型容量比对双重下降和成员推断性能的影响。
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Published: 25 May, 2025 at 11:24 AM

85.90 🤔

本文通过探测和激活编辑实验，系统研究了语言模型内部信念表征的涌现、结构、鲁棒性和可增强性，发现表征随模型规模和微调改善，具有结构化特征但对提示变化脆弱，并可通过对比激活添加（CAA）显著提升ToM性能。
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Published: 31 May, 2025 at 11:15 AM

85.56 🤔

本文研究了大语言模型在强化学习后训练中对奖励噪声的鲁棒性，提出推理模式奖励（RPR）策略，通过奖励关键推理短语而非答案正确性显著提升性能，并用RPR校准噪声奖励模型，改善开放式任务表现。
Task-Core Memory Management and Consolidation for Long-term Continual Learning

Published: 17 May, 2025 at 11:01 AM

85.53 🤔

This paper introduces Long-CL, a human memory-inspired framework for long-term continual learning, leveraging task-core memory management and selective sample consolidation to significantly outperform baselines by 7.4% and 6.5% AP on two novel benchmarks, MMLongCL-Bench and TextLongCL-Bench, while mitigating catastrophic forgetting.

Tag: Robustness

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

How much do language models memorize?

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Task-Core Memory Management and Consolidation for Long-term Continual Learning