Tag: Instruction Tuning
All the articles with the tag "Instruction Tuning".
-
Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant
This paper comprehensively evaluates the impact of four quantization methods (GPTQ, AWQ, SmoothQuant, FP8) on instruction-tuned LLMs and SLMs from 1B to 405B parameters across 13 datasets, revealing that quantized models often outperform smaller baselines but struggle with instruction-following and hallucination detection, with FP8 showing robustness and task difficulty not always correlating with accuracy loss.
-
Reverse Preference Optimization for Complex Instruction Following
本文提出逆向偏好优化(RPO)方法,通过动态反转指令中未满足的约束消除偏好对噪声,在多轮复杂指令跟随任务上显著优于DPO基线,并在70B模型上超越GPT-4o。
-
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
本文提出'Trajectory Policy Gradient Theorem',从理论上证明在LLM在线强化学习中仅用响应级别奖励即可无偏估计token级奖励的策略梯度,并基于此设计了TRePO算法,简化PPO设计并具备token级建模能力。
-
Unveiling the Mechanisms of Explicit CoT Training: How CoT Enhances Reasoning Generalization
本文通过控制实验、内部机制分析和理论推导,揭示了显式思维链(CoT)训练通过形成二阶段泛化电路显著提升大型语言模型的分布内(ID)和分布外(OOD)推理泛化能力,并验证了其在噪声数据下的鲁棒性。
-
What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction
本文通过理论分析区分了语言模型输出概率的三种解释(完成分布、响应分布、事件分布),揭示了现有研究中对这些分布的混淆和误解,并呼吁谨慎解释模型概率以指导LLM的开发和应用。