Skip to content
Go back 2505.02922 arXiv logo

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

Published:  at  11:06 AM
73.12 🤔

RetroInfer reimagines the KV cache as a vector storage system, using an attention-aware wave index and wave buffer to achieve up to 4.5x speedup over full attention and 10.5x over sparse baselines for long-context LLM inference, while preserving near-full-attention accuracy.

Large Language Model, Long Context, Efficiency, Transformer, Representation Learning

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang

Microsoft Research, University of Science and Technology of China, Wuhan University, Tsinghua University, Shanghai Jiao Tong University

Generated by grok-3

Background Problem

The rapid expansion of context windows in large language models (LLMs) to support applications like multi-turn conversations and large-scale data analysis has led to significant challenges in efficient inference. The primary issue is the linear growth of the key-value (KV) cache with sequence length, causing GPU memory and bandwidth constraints that limit batch sizes and throughput, especially for contexts exceeding 128K tokens. Existing solutions, including offloading KV cache to CPU memory, suffer from PCIe bottlenecks, while sparsity-based methods struggle with dynamic token importance identification and hardware coordination. RetroInfer aims to solve these problems by reconceptualizing the KV cache as a vector storage system to exploit attention sparsity, enabling scalable long-context inference without sacrificing accuracy.

Method

RetroInfer introduces a novel system design that treats the KV cache as a vector storage system, focusing on attention-aware mechanisms to accelerate long-context LLM inference. Its core components are:

Experiment

RetroInfer was evaluated on three LLMs (Llama 3.1-8B, Qwen2.5-7B, Llama3-8B-1048K) using long-context benchmarks like RULER, Needle-in-a-Haystack (NIAH), and LongBench, with context lengths up to 1M tokens and varying batch sizes. Experiments were conducted on an NVIDIA A100 GPU (80GB) and AMD EPYC CPU (1900GB) setup with PCIe 4.0. The setup compared RetroInfer against full attention and sparse attention baselines (Quest, MagicPIG, InfiniGen) under a fixed retrieval budget of 1.8%. Results showed RetroInfer achieving near-full-attention accuracy (e.g., only 0.73-1.46% drop on RULER) while outperforming sparse baselines by up to 17.02% in accuracy. Throughput-wise, it delivered up to 4.5x speedup over full attention within GPU memory limits and 10.5x over sparse methods when offloading to CPU, demonstrating scalability. The experimental design was comprehensive, covering diverse models and tasks, though the fixed segment size and cache size (5% of KV vectors) might not generalize to all scenarios. Results matched expectations for efficiency but raise questions about robustness across varying sparsity patterns not fully captured in benchmarks.

Further Thoughts

RetroInfer’s approach of treating the KV cache as a vector storage system opens intriguing avenues for further exploration, particularly in its intersection with other AI domains like multimodal systems where long-context processing could involve diverse data types (e.g., text, image, audio embeddings). Could the wave index’s clustering and retrieval mechanisms adapt to multimodal sparsity patterns, potentially enhancing efficiency in vision-language models? Additionally, the reliance on temporal locality for high cache hit ratios prompts a connection to reinforcement learning environments, where sequential decision-making might exhibit similar locality—could RetroInfer’s wave buffer inspire memory management in RL agents for long-horizon tasks? A potential limitation lies in the static segment size for clustering; integrating adaptive segmentation based on context or task type, perhaps drawing from meta-learning principles, could improve robustness. Finally, while the paper focuses on inference, exploring its applicability to continual learning scenarios, where models update with streaming data, might reveal new challenges in maintaining index efficiency over dynamic contexts. These connections highlight RetroInfer’s potential to influence broader AI system designs beyond LLMs.



Previous Post
Communication-Efficient Wireless Federated Fine-Tuning for Large-Scale AI Models
Next Post
Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?