Tag: Efficiency

All the articles with the tag "Efficiency".

CCSK:Cognitive Convection of Self-Knowledge Based Retrieval Augmentation for Large Language Models

Published: 7 May, 2025 at 08:43 AM

70.69 🤔

本文提出CCSK框架，通过Siamese Network和Response Quality Model动态融合查询相似性和响应质量，优化大型语言模型的信息检索决策，在多个问答数据集上显著提升了F1分数和准确率。
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Published: 4 May, 2025 at 04:30 PM

70.65 🤔

本文提出MegaScale-Infer系统，通过分离注意力模块和FFN模块的并行策略以及高效M2N通信库，优化大规模MoE模型的推理效率，实现高达1.90倍的吞吐量提升。
LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Published: 13 May, 2025 at 11:09 AM

70.54 🤔

This paper introduces LLM-independent adaptive retrieval using 27 external information features across 7 groups, achieving comparable QA performance to LLM-based methods on 6 datasets while significantly improving efficiency by eliminating additional LLM calls during inference.
Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

Published: 6 May, 2025 at 01:27 AM

70.15 🤔

本文提出了低秩知识遗忘（LoKU）框架，包含反向铰链损失（IHL）和 Fisher 加权低秩适配器初始化（FILA），以实现鲁棒且参数高效的大语言模型知识遗忘，有效移除敏感信息同时保持模型原有能力。
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Published: 4 May, 2025 at 04:29 PM

70.02 🤔

本文提出TeLLMe，一种能量高效的三元LLM FPGA加速器，通过表查找矩阵引擎和反向注意力优化，支持预填充和解码阶段，在7W功率下实现高达9.51 tokens/s吞吐量和低预填充延迟。

Tag: Efficiency

CCSK:Cognitive Convection of Self-Knowledge Based Retrieval Augmentation for Large Language Models

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs