RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

RADLADS introduces a cost-effective three-step distillation protocol to convert softmax attention transformers into linear attention models using only 350-700M tokens, achieving near-teacher performance on benchmarks and setting a new state-of-the-art for pure RNNs with models up to 72B parameters.

Transformer, Large Language Model, Pre-training, Fine-tuning, Efficiency

Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

Recursal AI, EleutherAI, Dalle Molle Institute for Artificial Intelligence USI-SUPSI, George Mason University

Generated by grok-3

Background Problem

The research addresses the high computational and financial cost of training large language models with linear attention from scratch, a significant barrier for most organizations. Traditional softmax attention transformers, while effective, suffer from O(N) time complexity per token and high memory usage due to Key-Value caches, especially for long sequences. Linear attention models promise O(1) time complexity and constant memory usage, but training them at scale often requires trillions of tokens, making it infeasible for many. The key problem solved by this work is the efficient conversion of pre-trained softmax attention transformers into high-performing linear attention models using a fraction of the training data (less than 0.005% of original token counts), thereby democratizing access to efficient large-scale models and enabling rapid experimentation with new architectures.

Method

The RADLADS protocol is a three-step distillation process to convert softmax attention transformers into linear attention decoders:

Setup - Attention Weights Transfer: Initialize student model weights by transferring attention-related weights (e.g., query, key, value) from the teacher model to equivalent parameters in the student’s recurrent architecture (RAD-RWKV6 or RAD-RWKV7, variants of RWKV with modifications like Gated Linear Attention and optional RoPE).
Step 1 - Attention Hidden State Alignment: Train each student sequence mixing layer to approximate the teacher’s attention layer hidden state outputs using L2 loss, with 100M tokens at sequence length 512, and a cosine-annealed learning rate from 1e-3 to 1e-5.
Step 2 - Knowledge Distillation: Train the entire student model to match teacher logits via Kullback-Leibler Divergence loss, using 250-500M tokens at a flat learning rate of 1e-5.
Step 3 - Context Length Extension: Fine-tune on longer sequences (up to 16384) with cross-entropy loss for 100M tokens to enhance long-context capabilities. The method leverages the DCLM dataset and introduces two new architectures (RAD-RWKV6 ‘RADFinch’ and RAD-RWKV7 ‘RADGoose’) optimized for conversion by simplifying RWKV designs, removing unnecessary components like tokenshift in RAD-RWKV7, and retaining teacher model structures like Grouped Query Attention where applicable.

Experiment

The experiments converted Qwen2.5 models (7B, 32B, 72B parameters) into linear attention models using RADLADS, evaluated on standard benchmarks including LAMBADA, MMLU, ARC, PIQA, Winogrande, and HellaSwag. The setup used the DCLM dataset with token counts of 350-700M across three steps, significantly less than prior methods (e.g., SUPRA at 100B tokens). Results show RADLADS models achieving high accuracy ratios compared to teacher models (e.g., QRWKV7-7B-Instruct at 0.924 MMLU ratio vs. teacher), often outperforming other conversion methods like MOHAWK or LOLCats, with QRWKV6-72B setting a new state-of-the-art for pure RNNs. The experimental design is reasonable for efficiency testing, focusing on token reduction and benchmark performance, but lacks diversity in teacher models (only Qwen2.5) and real-world long-context tasks, which are critical for linear attention benefits. Ablation studies reveal marginal impacts of architectural tweaks (e.g., RoPE or tokenshift), suggesting some claimed innovations might be overstated. While results match the expectation of maintaining quality with reduced training, the lack of stability analysis at larger scales (noted for RAD-RWKV7) and potential dataset bias (DCLM matching Qwen pre-training) temper the comprehensiveness of the evaluation.

Further Thoughts

The RADLADS approach opens up fascinating avenues for reducing the barrier to entry for experimenting with linear attention models, particularly in resource-constrained settings, but its reliance on specific teacher models like Qwen2.5 raises questions about broader applicability. Could this method be adapted to other foundation models with different pre-training distributions, such as Llama or Mistral, without significant performance drops? Additionally, the focus on benchmark performance might overlook nuanced capabilities like reasoning or long-context coherence, areas where linear attention should theoretically excel—future work could integrate datasets or tasks from initiatives like the LongBench suite to test these aspects. Another intriguing connection is to federated learning paradigms; if RADLADS can be applied in distributed settings, it might enable smaller entities to collaboratively convert and deploy efficient models while preserving data privacy. Lastly, the stability issues at larger scales hint at potential architectural limitations—exploring hybrid approaches or integrating insights from recent state-space models like Mamba could offer a path to more robust conversions. These directions warrant deeper investigation to solidify RADLADS as a generalizable framework.