Compact Recurrent Transformer with Persistent Memory

This paper introduces the Compact Recurrent Transformer (CRT), which combines shallow Transformers with RNNs to efficiently process long sequences using a single persistent memory vector, achieving superior or comparable performance to full-length Transformers and Transformer-XL on language and video tasks with significantly reduced computational cost.

Transformer, RNN, Efficiency, Prediction, Classification, Multimodality

Edison Mucllari, Zachary Daniels, David Zhang, Qiang Ye

University of Kentucky, SRI International

Generated by grok-3

Background Problem

The Transformer architecture, while highly successful in language and video processing tasks, struggles with long sequences due to the quadratic complexity of its self-attention mechanism, making it computationally expensive and infeasible for low Size, Weight, and Power (low-SWaP) devices like those used in edge computing. Existing solutions, such as Transformer-XL and Recurrent Memory Transformer, manage long sequences by segmenting them and using memory mechanisms, but often introduce significant computational overhead. This paper aims to solve the problem of efficiently processing long sequences with reduced computational cost by proposing a Compact Recurrent Transformer (CRT) that combines shallow Transformers for local segment processing with RNNs for global memory management, targeting applications in resource-constrained environments.

Method

The Compact Recurrent Transformer (CRT) integrates Transformer and Recurrent Neural Network (RNN) architectures to handle long sequences efficiently. Its core idea is to process short local segments using a shallow Transformer while maintaining a single persistent memory vector via an RNN (using GRU or NCGRU) to summarize global context across segments. The method works as follows: (1) Input segments are concatenated with a memory token derived from the previous segment’s RNN hidden state, allowing the Transformer to attend to both local tokens and historical context. (2) After processing through Transformer layers, the output embeddings are fed into an RNN to update the memory vector, which is passed to the next segment. (3) A separate RNN is used for recurrent positional encoding to align memory and input tokens without complex relative positional computations. This hybrid approach aims to leverage the Transformer’s local modeling strength and the RNN’s ability to manage sequential information, reducing computational complexity compared to full-length Transformers or other recurrent Transformer variants.

Experiment

The experiments evaluate CRT on language modeling tasks using Word PTB and WikiText-103 datasets, and video classification on the Toyota Smarthome dataset. For language modeling, CRT is compared against Transformer and Transformer-XL across 3-layer and 16-layer models with varying segment lengths (17 to 150 tokens). Results show CRT achieves lower perplexity (e.g., 58.3 vs. 67.0 for Transformer on Word PTB with 3 layers and 70-token segments) while using significantly fewer FLOPs (5e9 less than Transformer-XL in some settings) and maintaining performance with shorter segments. On WikiText-103, CRT remains competitive with Transformer-XL, often within 1-2 perplexity points, despite using a single memory vector. For video classification, the Compact Recurrent Vision Transformer (CR-ViT) outperforms state-of-the-art models like PI-ViT on Toyota Smarthome, achieving a mean class accuracy of 73.4 without additional pose data. The experimental setup is reasonable for demonstrating efficiency on edge devices, but it lacks stress tests on extremely long sequences or diverse video datasets to fully validate long-term dependency handling. Ablation studies confirm the complementary benefits of RNN memory and positional encoding, though deeper analysis of failure modes is missing. Overall, results match the expectation of improved efficiency and performance, but the comprehensiveness of testing could be enhanced.

Further Thoughts

While the CRT’s approach to memory compression via a single vector is innovative, I wonder if this could lead to information bottlenecks for extremely long sequences, a limitation inherent to RNN hidden states that the authors acknowledge but do not fully explore. Future work could investigate hybrid memory mechanisms, perhaps integrating sparse memory structures or attention-based memory updates to mitigate saturation issues. Additionally, the application to edge computing is compelling, but real-world deployment scenarios often involve noisy or incomplete data, which might challenge the robustness of the persistent memory mechanism—testing on such conditions could be insightful. Relating this to broader AI trends, CRT’s efficiency focus aligns with the growing interest in sustainable AI, where reducing computational footprints is critical. It might be interesting to explore CRT’s integration with federated learning paradigms, where edge devices collaboratively train models under resource constraints, potentially enhancing privacy and scalability in distributed systems. Lastly, comparing CRT’s memory approach with recent advancements in state space models, which also aim for efficient sequence modeling, could provide a deeper understanding of trade-offs between explicit memory (as in CRT) and implicit state representations.