Tag: Safety

All the articles with the tag "Safety".

Enhancing Safety Standards in Automated Systems Using Dynamic Bayesian Networks

Published: 8 May, 2025 at 11:06 AM

90.63 🤔

This paper proposes a Dynamic Bayesian Network framework for autonomous vehicles that enhances safety in cut-in maneuvers by integrating lateral evidence and probabilistic safety assessments, achieving superior crash avoidance in high-speed scenarios (9.22% crash rate) compared to baseline models in the JRC-FSM simulator.
MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

Published: 8 May, 2025 at 10:22 AM

89.40 🤔

MELON introduces a novel training-free defense against indirect prompt injection attacks on LLM agents by detecting independence of tool calls from user inputs through masked re-execution, achieving superior attack prevention (0.24% ASR on GPT-4o) and utility preservation (58.78% UA on GPT-4o) compared to existing methods.
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Published: 8 May, 2025 at 11:07 AM

87.75 🤔

ASTRA introduces an efficient defense for Vision Language Models by adaptively steering activations away from adversarial directions using image attribution, achieving state-of-the-art performance in mitigating jailbreak attacks with minimal impact on benign utility and high inference efficiency.
Activation Space Interventions Can Be Transferred Between Large Language Models

Published: 8 May, 2025 at 06:19 PM

85.65 🤔

This paper demonstrates that activation space interventions for AI safety, such as backdoor removal and refusal behavior, can be transferred between large language models using autoencoder mappings, enabling smaller models to align larger ones, though challenges remain in cross-architecture transfers and complex tasks like corrupted capabilities.
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Published: 4 May, 2025 at 04:27 PM

78.97 🤔

本文提出Head-Specific Intervention (HSI)方法，通过针对特定注意力头的激活干预，成功诱导Llama 2模型在AI协调行为上绕过安全对齐，效果优于监督微调和其它干预策略。

Tag: Safety

Enhancing Safety Standards in Automated Systems Using Dynamic Bayesian Networks

MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Activation Space Interventions Can Be Transferred Between Large Language Models

HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models