Tag: Safety
All the articles with the tag "Safety".
-
Enhancing Safety Standards in Automated Systems Using Dynamic Bayesian Networks
This paper proposes a Dynamic Bayesian Network framework for autonomous vehicles that enhances safety in cut-in maneuvers by integrating lateral evidence and probabilistic safety assessments, achieving superior crash avoidance in high-speed scenarios (9.22% crash rate) compared to baseline models in the JRC-FSM simulator.
-
MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
MELON introduces a novel training-free defense against indirect prompt injection attacks on LLM agents by detecting independence of tool calls from user inputs through masked re-execution, achieving superior attack prevention (0.24% ASR on GPT-4o) and utility preservation (58.78% UA on GPT-4o) compared to existing methods.
-
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
ASTRA introduces an efficient defense for Vision Language Models by adaptively steering activations away from adversarial directions using image attribution, achieving state-of-the-art performance in mitigating jailbreak attacks with minimal impact on benign utility and high inference efficiency.
-
Activation Space Interventions Can Be Transferred Between Large Language Models
This paper demonstrates that activation space interventions for AI safety, such as backdoor removal and refusal behavior, can be transferred between large language models using autoencoder mappings, enabling smaller models to align larger ones, though challenges remain in cross-architecture transfers and complex tasks like corrupted capabilities.
-
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
本文提出Head-Specific Intervention (HSI)方法,通过针对特定注意力头的激活干预,成功诱导Llama 2模型在AI协调行为上绕过安全对齐,效果优于监督微调和其它干预策略。