Posts
All the articles I've posted.
-
Communication-Efficient Wireless Federated Fine-Tuning for Large-Scale AI Models
本文提出了一种无线联邦LoRA微调框架,通过Sparsified Orthogonal Fine-Tuning (SOFT) 和Two Stage Federated Algorithm (TSFA) 优化参数稀疏化和动态资源分配,提高了通信效率和学习性能。
-
Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains
本文提出了一种缓存高效的后验采样框架,通过元学习优化的缓存机制重用LLM先验,显著降低强化学习中的计算成本(查询减少3.8-4.7倍,延迟降低4.0-12.0倍),同时在文本和连续控制任务中保持96-98%的性能。
-
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer reimagines the KV cache as a vector storage system, using an attention-aware wave index and wave buffer to achieve up to 4.5x speedup over full attention and 10.5x over sparse baselines for long-context LLM inference, while preserving near-full-attention accuracy.
-
Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?
本文首次系统探索知识蒸馏技术在基于大语言模型的捆绑生成任务中的应用,通过提出一个全面的 KD 框架和实验验证,证明了在减少计算需求的同时能保持甚至提升性能。
-
Llama-Nemotron: Efficient Reasoning Models
NVIDIA 发布了 Llama-Nemotron 系列开放模型,通过结合神经架构搜索、知识蒸馏、持续预训练、基于高质量合成数据的多阶段有监督微调和大规模强化学习,构建了在推理能力和效率上均达到领先水平、并支持动态推理模式切换的异构模型家族。