Tag: Efficiency
All the articles with the tag "Efficiency".
-
Does Self-Attention Need Separate Weights in Transformers?
This paper introduces a shared weight self-attention mechanism for transformers, using a single weight matrix with diagonal scaling to reduce parameters by 66.53% in attention blocks, achieving competitive performance on GLUE and improved noise robustness while slightly underperforming on SQuAD tasks compared to standard BERT.
-
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation
本文提出DPE,一种无需训练的长文本外推方法,通过检测RoPE不同维度组的有效相对距离并识别关键维度,有选择地调整这些关键维度的位置索引,显著扩展了LLM的上下文窗口并提升了长文本任务性能。
-
LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
LSAQ introduces a novel Layer-Specific Adaptive Quantization system for LLMs, using Jaccard similarity to assess layer importance and dynamically adjusting quantization precision based on edge device resources, achieving superior accuracy on zero-shot tasks and lower perplexity compared to baseline methods while enabling efficient deployment.
-
Accelerating Large Language Model Reasoning via Speculative Search
Speculative Search (SpecSearch) accelerates LLM reasoning by up to 2.12× through a bi-level speculative thought generator that collaborates between small and large models, maintaining comparable reasoning quality via a quality-preserving rejection mechanism.
-
Efficient Reasoning for LLMs through Speculative Chain-of-Thought
本文提出了推测思维链(SCoT)框架,通过轻量级草稿模型并行生成多个思维链草稿,并由微调后的目标大模型选择最佳草稿或决定重新思考,从而在保持接近大模型准确率的同时,显著降低了大型语言模型的推理延迟。