Tag: Transformer

All the articles with the tag "Transformer".

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Published: 4 May, 2025 at 04:33 PM

69.21 🤔

本文提出Mixture of Sparse Attention (MoSA)方法，通过专家选择路由实现基于内容的稀疏注意力，显著提高了Transformer模型在相同计算预算下的语言建模性能，并优化了资源使用。
Compact Recurrent Transformer with Persistent Memory

Published: 9 May, 2025 at 11:06 AM

66.84 🤔

This paper introduces the Compact Recurrent Transformer (CRT), which combines shallow Transformers with RNNs to efficiently process long sequences using a single persistent memory vector, achieving superior or comparable performance to full-length Transformers and Transformer-XL on language and video tasks with significantly reduced computational cost.
Adaptive Layer-skipping in Pre-trained LLMs

Published: 4 May, 2025 at 04:28 PM

62.55 🤔

本文提出FlexiDepth方法，通过插件式路由器和适配器实现预训练LLM的自适应层跳过，提高计算效率同时保持生成性能，并通过实验揭示了token类型对计算需求的影响。
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Published: 4 May, 2025 at 04:28 PM

59.39 🤔

本研究提出 SpargeAttn，一种通用稀疏注意力机制，通过两阶段在线过滤器和量化技术加速各种模型的推理，同时保持端到端性能无损。
On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

Published: 4 May, 2025 at 04:29 PM

53.38 🤔

本文提出软件硬件协同优化框架，通过 AWQ 模型压缩和 FPGA 加速在边缘设备上高效部署 Qwen2.5-0.5B 模型，实现 55.1% 的压缩率和 5.1 tokens/s 的推理速度，同时保持较高准确性。

Tag: Transformer

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Compact Recurrent Transformer with Persistent Memory

Adaptive Layer-skipping in Pre-trained LLMs

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration