Tag: Pre-training
All the articles with the tag "Pre-training".
-
Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?
LLM-Sieve提出了一种任务特定的剪枝框架,通过联合低秩投影和遗传算法实现差异化剪枝,在保持1-5%精度损失下减少20-75%的参数,显著优于现有方法,并与LoRA微调和量化兼容。
-
Zebra-Llama: Towards Extremely Efficient Hybrid Models
Zebra-Llama通过结合状态空间模型和多头潜在注意力层,从预训练Transformer构建高效混合模型,显著降低KV缓存大小并提升推理吞吐量,同时保持或超越基线性能。
-
The Mosaic Memory of Large Language Models
This paper introduces the concept of 'mosaic memory' in Large Language Models, demonstrating through experiments on canaries and real-world datasets like SlimPajama that LLMs memorize training data via fuzzy duplicates with partial overlaps, predominantly syntactically, challenging existing deduplication practices and raising concerns for privacy, model utility, and benchmark fairness.
-
Do Language Models Use Their Depth Efficiently?
本文通过对Llama 3.1和Qwen 3模型的残差流分析和干预实验,发现大型语言模型未有效利用深度,后半部分层主要细化概率分布而非进行新计算,且处理深度与输入复杂性无关,提示当前架构和训练目标需改进。
-
Memorization-Compression Cycles Improve Generalization
本文通过提出信息瓶颈语言建模(IBLM)目标和Gated Phase Transition (GAPT)算法,理论和实验上证明了通过动态切换记忆和压缩阶段来降低表征熵,可以显著提升大型语言模型的泛化能力和冲突记忆分辨能力。