This survey paper provides a comprehensive overview of adversarial attacks on multimodal AI systems across text, image, video, and audio modalities, categorizing threats by attacker knowledge, intention, and execution to equip practitioners with knowledge of vulnerabilities and cross-modal risks.
Multimodal Systems, Robustness, Safety, AI Ethics, Human-AI Interaction
Shashank Kapoor, Sanjay Surendranath Girija, Lakshit Arora, Dipen Pradhan, Ankit Shetgaonkar, Aman Raj
Generated by grok-3
Background Problem
The rapid advancement and democratization of multimodal AI models, capable of processing text, image, video, and audio, have introduced significant vulnerabilities to adversarial attacks across and within these modalities. While extensive research exists on adversarial threats, there is a lack of practitioner-focused resources that consolidate and simplify the complex threat landscape in the multimodal context. This survey aims to address this gap by providing a comprehensive overview of adversarial attack types and their evolution, equipping machine learning practitioners with the knowledge to recognize and mitigate potential risks in deploying open-source multimodal models.
Method
This paper does not propose a novel method but instead conducts a survey of existing adversarial attack strategies targeting multimodal AI systems. It establishes a taxonomy based on the attacker’s knowledge (white-box vs. black-box), intention (targeted vs. untargeted), and execution (optimization-based, data poisoning/backdoor, membership inference, model inversion, and cross-modal attacks). The survey details attack mechanisms across four modalities—text, image, video, and audio—citing specific techniques such as gradient-based perturbations (e.g., HotFlip for text), norm-based perturbations for images, and heuristic approaches for video. It also explores cross-modal attacks that exploit joint embeddings to manipulate outputs across modalities. The main steps involve categorizing prior research, summarizing attack execution methods, and presenting a matrix of cross-modal influences to highlight interconnected risks.
Experiment
As a survey paper, this work does not include original experiments or datasets. Instead, it compiles and summarizes results from numerous cited studies on adversarial attacks across modalities. The paper references specific attack success rates and methodologies (e.g., high attack success in text with GCG, imperceptible audio perturbations by Qin et al., and cross-modal attacks like SneakyPrompt breaking AI alignment) but does not critically evaluate the experimental setups or reproducibility of these results. The organization of the survey by attack type and modality is logical, aiming to provide a clear structure for practitioners, though the lack of analysis on the comprehensiveness or real-world applicability of cited experiments limits its depth. The expectation of informing practitioners about threats is partially met through breadth, but actionable insights are missing due to the absence of critical assessment or defense strategies.
Further Thoughts
While this survey effectively maps the landscape of adversarial attacks on multimodal systems, it raises questions about the practical utility for practitioners without deeper critical analysis or defense recommendations. An intriguing connection could be drawn to the field of AI alignment and safety, where adversarial attacks often exploit misalignments in model objectives—could integrating alignment techniques like RLHF (Reinforcement Learning from Human Feedback) mitigate some cross-modal vulnerabilities? Additionally, the paper’s mention of fragmented defense literature suggests a need for standardized benchmarks or tools, akin to datasets like ImageNet for vision tasks, to evaluate multimodal robustness systematically. Exploring how adversarial attacks in multimodal systems relate to emergent abilities in large foundation models could also be insightful, as unexpected behaviors might amplify vulnerabilities. Finally, the lack of focus on video-specific attacks hints at an underexplored area—could temporal dynamics in video data introduce unique attack vectors not yet considered in image-based research?