Recall with Reasoning: Chain-of-Thought Distillation for Mamba's Long-Context Memory and Extrapolation

This paper proposes Recall with Reasoning (RwR), a method that enhances Mamba’s long-context memory and extrapolation by distilling chain-of-thought summarization from a teacher model, achieving significant performance improvements on LONGMEMEVAL and HELMET benchmarks while preserving short-context capabilities.

State Space Model, Long Context, Fine-tuning, Reasoning, Representation Learning

Junyu Ma, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu

Tencent AI Lab

Generated by grok-3

Background Problem

Transformer-based Large Language Models (LLMs) excel in various tasks but suffer from quadratic complexity and poor length extrapolation for long contexts. In contrast, Mamba, a state space model (SSM), offers linear complexity and theoretical infinite-context potential; however, it struggles with long-context memory when input sequences exceed training lengths. Existing methods like DeciMamba and ReMamba attempt to address this by compressing or filtering unimportant tokens, but they often compromise input integrity and fail to significantly improve performance at very long contexts. This paper introduces Recall with Reasoning (RwR) to unlock Mamba’s long-context memory and extrapolation ability by distilling chain-of-thought (CoT) summarization from a teacher model, aiming to enhance active recall and reasoning over extended sequences without architectural modifications.

Method

Recall with Reasoning (RwR) is a data-driven approach to enhance Mamba’s long-context memory by distilling chain-of-thought (CoT) summarization from a Transformer-based teacher model. The core idea is to teach Mamba to actively recall and reason over long contexts by prepending query-aware summaries as CoT prompts during fine-tuning. The method involves two main steps: (1) Summary-based CoT Construction, where a Transformer model (Llama-3.1-8B-Instruct) extracts relevant summaries from long contexts in the OpenOrca dataset, validated by GPT-4o for consistency with ground-truth answers, and constructs both valid summary data (context-query-summary triples) and empty summary data (context without answers to prevent overconfidence); (2) Segmented Summarization for Answering (SSA), a strategy for very long inputs where the context is divided into smaller segments, summaries are generated for each, and these summaries are fed into Mamba to answer queries, maintaining manageable processing lengths. Mamba is fine-tuned on a combined dataset of OpenOrca and the constructed CoT data, enabling it to identify key information and improve long-context recall without altering its architecture.

Experiment

The experiments evaluate RwR’s impact on Mamba’s long-context memory and extrapolation using two benchmarks: LONGMEMEVAL (for chat assistant memory tasks at 10k and 100k token lengths) and HELMET (for diverse long-context tasks at 16k tokens). Short-context tasks (e.g., RTE, GSM8K) are also tested to assess any negative impact. The setup uses Mamba-2.8b as the backbone, fine-tuned with 100,000 OpenOrca samples and 10,000 constructed CoT summary data, truncated to 6,000 tokens due to memory constraints. Baselines include untuned Mamba, fine-tuned Mamba (SFT), DeciMamba, and ReMamba. Results show RwR significantly outperforms baselines in long-context tasks, achieving a weighted average of 27.6% on LONGMEMEVAL (10k) compared to ReMamba’s 24.4%, and 9.8% at 100k compared to ReMamba’s 4.0%, with further improvement to 11.4% using SSA. On HELMET, RwR scores 34.3% against ReMamba’s 31.8%. For short-context tasks, RwR slightly improves performance (e.g., 93.0% on Reasoning vs. Mamba SFT’s 88.5%), unlike baselines which show declines. The experimental design is reasonable for demonstrating long-context improvement, though limited by not testing beyond 100k tokens or on other SSMs, and the reliance on powerful teacher models for summary generation raises scalability concerns. Compared to expectation, the results align well with the hypothesis of CoT enhancing long-context memory, though the marginal short-context gains suggest the method’s primary value is in extended sequences.

Further Thoughts

The RwR approach opens up interesting avenues for enhancing state space models like Mamba, particularly in long-context scenarios, but it also prompts deeper questions about scalability and adaptability. The reliance on a powerful Transformer model like GPT-4o for summary validation could be a bottleneck in resource-constrained environments—could a lighter model or an unsupervised method achieve similar validation quality? Additionally, the segmented summarization strategy (SSA) might risk losing critical cross-segment dependencies in very long contexts; exploring hierarchical summarization or dynamic segment sizing could mitigate this. I’m also curious about the potential of applying RwR to other domains, such as time-series data or graph data, where long-range dependencies are crucial—could CoT distillation be adapted to non-textual contexts to enhance memory in SSMs? Furthermore, connecting this work to recent advancements in parameter-efficient fine-tuning (like Low-Rank Adaptation), one might investigate if combining RwR with such techniques could reduce the computational overhead of fine-tuning on large datasets. Lastly, the paper’s limitation in testing beyond 100k tokens suggests a need for future work on extreme long-context scenarios, potentially aligning with research on scaling laws to understand how performance scales with context length in SSMs versus Transformers.