MoM: Linear Sequence Modeling with Mixture-of-Memories

The Mixture-of-Memories (MoM) architecture introduces multiple independent memory states with a routing mechanism to enhance memory capacity and reduce interference in linear sequence modeling, achieving significant performance gains over other linear models on recall-intensive tasks and nearing Transformer performance at larger scales while maintaining efficiency.

Linear Sequence Modeling, Memory Capacity, Memory Interference, Routing Mechanism, Long-Term Dependencies, Efficiency

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng

Shanghai AI Laboratory, Nanjing University, South China University of Technology, The Hong Kong University of Science and Technology (Guangzhou), The Chinese University of Hong Kong

Generated by grok-3

Background Problem

Linear sequence modeling methods, such as linear attention and state space models, address the quadratic complexity issue of Transformers (O(n^2)) by achieving linear training complexity (O(n)) and constant inference complexity (O(1)). However, their compression of entire sequences into a single fixed-size memory state results in limited memory capacity and memory interference, degrading performance on recall-intensive tasks where long-term context retention is crucial. Inspired by neuroscience mechanisms like theta-gamma oscillations in the hippocampus, which separate memory items to prevent interference, this work introduces Mixture-of-Memories (MoM) to enhance memory capacity and reduce interference while retaining efficiency benefits.

Method

The Mixture-of-Memories (MoM) architecture proposes multiple independent memory states to store diverse sequence information, mitigating interference seen in single-memory linear models. It operates via a router network that assigns input tokens to specific memory states using a top-k scoring mechanism (softmax over a linear layer output), ensuring only a subset of memories (e.g., 2 out of 4) are activated and updated per token. Each activated memory undergoes a linear recurrent update, projecting inputs into key-value pairs and updating the memory state (e.g., M_t^m = M_{t-1}^m + (k_t^m)^T v_t^m), with optional gating mechanisms like forget gates for better control. A shared memory state, continuously updated with all inputs, captures long-term dependencies. Outputs are computed as a weighted sum of activated memory states (mixed memory) queried by a token-specific vector, followed by normalization and activation. This design maintains linear complexity during training and constant complexity during inference, leveraging hardware-efficient implementations like chunkwise parallelism.

Experiment

Experiments evaluate MoM against linear models (e.g., RetNet, HGRN2, GLA, Gated DeltaNet) and Transformer++ across two model scales (340M and 1.3B parameters) on recall-intensive tasks (e.g., SQuAD, TriviaQA), commonsense reasoning tasks (e.g., PIQA, WinoGrande), and long-context benchmarks (e.g., LongBench). Datasets are chosen to test memory capacity and context handling, with training on SlimPajama (15B/100B tokens) using consistent setups (e.g., AdamW optimizer, cosine learning rate). Results show MoM significantly outperforms other linear models on recall-intensive tasks (e.g., 28.16 avg. score at 340M vs. 24.78 for Gated DeltaNet) and approaches Transformer performance at 1.3B (36.04 vs. 37.31). On commonsense reasoning, MoM achieves the best average scores among linear models (41.97 at 340M, 50.97 at 1.3B). Long-context and length extrapolation tests further confirm MoM’s superiority over linear baselines and better scalability than Transformers. A hybrid MoM-Transformer model (3 Transformer layers in 24 total) exceeds standalone Transformer performance, suggesting synergy. Efficiency tests validate linear complexity with lower memory and time costs for long sequences compared to Transformers. Ablation studies highlight optimal memory configurations (4 memories, 2 activated) and the importance of shared memory. However, the lack of analysis on routing failures or memory specialization limits deeper insights, and the performance gap with Transformers at smaller scales indicates potential scalability concerns.

Further Thoughts

The MoM architecture’s use of multiple memory states opens intriguing avenues for further exploration, particularly in how memory specialization (as hinted in Table 7) could align with domain-specific tasks, such as routing scientific terms to a dedicated memory for AI in science applications. This parallels recent trends in Mixture-of-Experts (MoE) where expert specialization enhances performance, suggesting a potential hybrid MoE-MoM framework where memory states act as domain experts. However, the reliance on a shared memory for long-term dependencies raises questions about whether this undermines the interference reduction goal—could a more dynamic shared memory update mechanism, perhaps inspired by attention sparsity in Transformers, further optimize performance? Additionally, the hybrid MoM-Transformer model’s success indicates that MoM might be best positioned as a complementary layer in mixed architectures rather than a full replacement for attention mechanisms, prompting research into optimal layer interleaving strategies. Finally, connecting MoM to neuroscience-inspired AI, such as spiking neural networks, could offer deeper insights into biologically plausible memory mechanisms, potentially addressing routing inaccuracies or memory load imbalances not fully explored in this paper.