Tag: Efficiency

All the articles with the tag "Efficiency".

Exploring the Role of Diversity in Example Selection for In-Context Learning

Published: 7 May, 2025 at 09:33 AM

64.87 🤔

本文提出基于多样性的上下文学习（DICL）方法，通过最大边际相关性（MMR）算法重新排序示例以平衡相关性和多样性，在多个数据集和大型语言模型上实现了约70%的下游任务性能提升或维持。
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Published: 4 May, 2025 at 04:30 PM

70.65 🤔

本文提出MegaScale-Infer系统，通过分离注意力模块和FFN模块的并行策略以及高效M2N通信库，优化大规模MoE模型的推理效率，实现高达1.90倍的吞吐量提升。
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Published: 4 May, 2025 at 04:29 PM

70.02 🤔

本文提出TeLLMe，一种能量高效的三元LLM FPGA加速器，通过表查找矩阵引擎和反向注意力优化，支持预填充和解码阶段，在7W功率下实现高达9.51 tokens/s吞吐量和低预填充延迟。
Splitwiser: Efficient LM inference with constrained resources

Published: 11 May, 2025 at 11:14 AM

60.85 🤔

Splitwiser introduces a method to split LLM inference phases on a single GPU using multiprocessing and NVIDIA MPS, achieving modest latency reductions (up to 18.2%) and throughput improvements (up to 1.42x) on Huggingface and vLLM pipelines, though constrained by overheads and scalability issues.
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Published: 4 May, 2025 at 04:33 PM

69.21 🤔

本文提出Mixture of Sparse Attention (MoSA)方法，通过专家选择路由实现基于内容的稀疏注意力，显著提高了Transformer模型在相同计算预算下的语言建模性能，并优化了资源使用。

Tag: Efficiency

Exploring the Role of Diversity in Example Selection for In-Context Learning

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Splitwiser: Efficient LM inference with constrained resources

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing