Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

This paper comprehensively evaluates the impact of four quantization methods (GPTQ, AWQ, SmoothQuant, FP8) on instruction-tuned LLMs and SLMs from 1B to 405B parameters across 13 datasets, revealing that quantized models often outperform smaller baselines but struggle with instruction-following and hallucination detection, with FP8 showing robustness and task difficulty not always correlating with accuracy loss.

Large Language Model, Efficiency, Instruction Tuning, Robustness, Multimodal Data

Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon

Electronics and Telecommunications Research Institute, Korea Electronics Technology Institute, Neubla

Generated by grok-3

Background Problem

Deploying large language models (LLMs) and small language models (SLMs) in resource-constrained environments, such as mobile-edge and server scenarios, is challenging due to their high memory and computational demands. Quantization, particularly Post-Training Quantization (PTQ), has emerged as a solution to reduce these overheads. However, prior research has focused on limited metrics like perplexity and outdated benchmarks, neglecting recent model architectures (e.g., Llama-3.3) and comprehensive task evaluations across diverse scales (1B to 405B parameters). This paper addresses these gaps by evaluating the impact of quantization on instruction-tuned models across a wide range of tasks and model sizes, focusing on performance trade-offs and task-specific challenges.

Method

The paper investigates the impact of four Post-Training Quantization (PTQ) methods—GPTQ, AWQ, SmoothQuant, and FP8—on instruction-tuned LLMs and SLMs ranging from 1B to 405B parameters. GPTQ and AWQ focus on weight-only quantization, with GPTQ using inverse Hessian information for bit reduction and AWQ prioritizing important weights via per-channel scaling. SmoothQuant smooths activation outliers before quantization to enhance robustness, while FP8 applies 8-bit quantization to both weights and activations for balanced performance. These methods are applied without retraining, using calibration datasets with default settings for sample size and sequence length. Evaluations are conducted in a multi-node GPU cluster environment using tools like vLLM and Huggingface Accelerate, ensuring consistent and reproducible results across 13 benchmark datasets categorized into six task types.

Experiment

The experiments evaluate 12 instruction-tuned models (Vicuna, Gemma, Llama families) ranging from 1B to 405B parameters, quantized using GPTQ, AWQ, SmoothQuant, and FP8, across 13 datasets grouped into six categories: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue (via MT-Bench). The setup uses a multi-cluster GPU environment with four server configurations (H100, A100, RTX 6000, A6000) for reliable assessments, synchronized with Huggingface OpenLLM Leaderboard-v1 and v2. Results show quantized models generally outperform smaller FP16 baselines, with significant improvements in SLMs (e.g., Llama-3.2-3B gains up to 13.32% over 1B). However, performance drops in instruction-following (IFEval) and hallucination detection (TruthfulQA) are notable. FP8 proves most robust across tasks and scales, while SmoothQuant struggles at 405B (e.g., up to 10.86% accuracy drop). AWQ consistently outperforms GPTQ in weight-only quantization. Smaller models (1B-3B) suffer severe accuracy loss at 4-bit (e.g., -25.32% on GSM8K for Llama-3.2-1B), unlike 70B models which remain stable. Task difficulty does not always predict accuracy loss, as quantization amplifies inherent model weaknesses. MT-Bench reveals significant declines in coding and STEM tasks, though reasoning sometimes improves. The setup is comprehensive, but reliance on automated judging (GPT4) introduces errors, and some results (e.g., Vicuna, Gemma) are relegated to appendices due to space constraints, limiting full analysis. Overall, results partially match expectations, but inconsistencies in evaluation metrics suggest a need for more robust judging mechanisms.

Further Thoughts

The findings on quantization’s varying impact across tasks and model sizes open up intriguing avenues for future research, particularly in understanding why certain tasks like instruction-following and hallucination detection are disproportionately affected. This could tie into broader discussions on model alignment and safety, as quantization might inadvertently exacerbate biases or errors in critical areas, a concern also raised in studies on model compression and robustness (e.g., works on RLHF and alignment). Additionally, the observed robustness of FP8 suggests potential synergies with hardware optimization research, such as NVIDIA’s advancements in mixed-precision training and inference, which could further reduce deployment costs in edge scenarios. Another thought is the integration of quantization with other efficiency techniques like pruning or knowledge distillation—could a hybrid approach mitigate the weaknesses amplified by quantization alone? Finally, the misjudgments by GPT4 in MT-Bench evaluations highlight a pressing need for more reliable evaluation frameworks, perhaps drawing from human-in-the-loop methodologies or cross-validation with diverse judge models, to ensure that quantization assessments reflect real-world performance accurately.