MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

MELON introduces a novel training-free defense against indirect prompt injection attacks on LLM agents by detecting independence of tool calls from user inputs through masked re-execution, achieving superior attack prevention (0.24% ASR on GPT-4o) and utility preservation (58.78% UA on GPT-4o) compared to existing methods.

Large Language Model, Agent, Safety, Robustness, Human-AI Interaction

Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang

University of California, Santa Barbara, William & Mary

Generated by grok-3

Background Problem

The paper addresses the critical security concern of indirect prompt injection (IPI) attacks on Large Language Model (LLM) agents, where malicious tasks embedded in tool-retrieved information (e.g., from databases or websites) can redirect agents to perform unauthorized actions. Existing defenses either require substantial resources for model retraining, fail against sophisticated attacks, or compromise the agent’s utility. The key problem solved is designing a lightweight, training-free defense that effectively prevents IPI attacks while maintaining high utility, balancing security and functionality in LLM agent systems.

Method

MELON (Masked re-Execution and tooL comparisON) is a novel IPI defense based on the insight that under a successful attack, an agent’s actions become less dependent on user inputs and more on malicious tasks in retrieved data. It operates by re-executing the agent’s trajectory with a masked state, where user inputs are replaced by a task-neutral prompt (T^f) and only tool outputs are preserved. The method compares tool calls between the original and masked executions using embedding-based similarity; if similar tool calls are detected, it indicates an attack due to independence from user input. Key designs include a customized masking function to prevent arbitrary tool calls, a tool call cache to handle timing mismatches, and a focused tool call comparison to reduce noise, addressing challenges like false positives and negatives.

Experiment

Experiments were conducted on the AgentDojo benchmark, featuring four agent types (banking, slack, travel, workspace) with 629 attack cases, using three LLMs: GPT-4o, o3-mini, and Llama-3.3-70B. Four representative IPI attacks (Direct, Ignore Previous Instructions, System Message, Important Messages) were tested against five baseline defenses and MELON/MELON-Aug. Metrics included Utility under Attack (UA), Attack Success Rate (ASR), and Benign Utility (BU). Results showed MELON achieving the lowest ASR (0.24% on GPT-4o) while maintaining high UA (58.78% on GPT-4o), outperforming baselines which often traded off utility for security. MELON-Aug further improved performance (0.32% ASR, 68.72% UA on GPT-4o). The setup was comprehensive, covering diverse attacks and models, though limited to AgentDojo and excluding multimodal benchmarks due to low attack success rates. Ablation studies validated key designs, showing increased ASR without them, and sensitivity tests confirmed robustness to parameter variations. However, failure analysis revealed limitations in detecting response-based attacks (72.73% of missed cases), indicating the method’s focus on tool calls misses some attack vectors. Overall, results matched expectations for tool-call-based attacks but highlighted gaps in broader attack coverage.

Further Thoughts

The concept of leveraging behavioral patterns like the independence of tool calls from user inputs in MELON opens up intriguing possibilities for broader application in AI security. It could potentially be adapted to detect other forms of manipulation beyond IPI, such as subtle biases or misinformation injected through external data sources in multimodal systems, by extending the masked re-execution to other output modalities like text responses. This ties into recent research on adversarial robustness in vision-language models, where similar discrepancies between expected and actual outputs under manipulated inputs are observed. However, the computational overhead of MELON (doubling API calls) raises scalability concerns, especially for real-time applications. Future work could explore integrating MELON with efficient caching mechanisms or lightweight anomaly detection models to reduce costs, drawing inspiration from federated learning approaches where resource constraints are paramount. Additionally, the high false positive rate, even if justified as security concerns, might deter adoption in user-centric applications; a hybrid approach combining MELON with user feedback loops could refine detection thresholds dynamically, enhancing usability while maintaining security.