Tag: Efficiency
All the articles with the tag "Efficiency".
-
CCSK:Cognitive Convection of Self-Knowledge Based Retrieval Augmentation for Large Language Models
本文提出CCSK框架,通过Siamese Network和Response Quality Model动态融合查询相似性和响应质量,优化大型语言模型的信息检索决策,在多个问答数据集上显著提升了F1分数和准确率。
-
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
本文提出MegaScale-Infer系统,通过分离注意力模块和FFN模块的并行策略以及高效M2N通信库,优化大规模MoE模型的推理效率,实现高达1.90倍的吞吐量提升。
-
LLM-Independent Adaptive RAG: Let the Question Speak for Itself
This paper introduces LLM-independent adaptive retrieval using 27 external information features across 7 groups, achieving comparable QA performance to LLM-based methods on 6 datasets while significantly improving efficiency by eliminating additional LLM calls during inference.
-
Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs
本文提出了低秩知识遗忘(LoKU)框架,包含反向铰链损失(IHL)和 Fisher 加权低秩适配器初始化(FILA),以实现鲁棒且参数高效的大语言模型知识遗忘,有效移除敏感信息同时保持模型原有能力。
-
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs
本文提出TeLLMe,一种能量高效的三元LLM FPGA加速器,通过表查找矩阵引擎和反向注意力优化,支持预填充和解码阶段,在7W功率下实现高达9.51 tokens/s吞吐量和低预填充延迟。