This paper introduces ‘Radio,’ a rate-distortion optimization framework for LLM compression that outperforms existing quantization methods in perplexity and downstream task accuracy, particularly at lower bit depths, by iteratively optimizing bit depths and using companding quantization post-training.
Large Language Model, Efficiency, Pre-training, Representation Learning
Sean I. Young
Martinos Center, Harvard Medical School, Boston, MA, USA, Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA, USA
Generated by grok-3
Background Problem
The rapid growth of Large Language Models (LLMs) with tens to hundreds of billions of parameters poses significant challenges for deployment on resource-constrained devices due to high memory and computational requirements. This memory-bound inference process increases latency in time-sensitive applications and contributes to a larger environmental footprint from AI infrastructure. Existing quantization methods, while effective at reducing model size, often lack a theoretical foundation for balancing accuracy and compression rate, leading to suboptimal performance. This paper addresses the problem of LLM compression by introducing a rate-distortion theoretic framework to optimize quantization, aiming to maximize model accuracy at a user-specified bit rate or model size post-training.
Method
The proposed method, named ‘Radio,’ formulates LLM quantization as a rate-distortion optimization problem to minimize output distortion while adhering to a target bit rate. It operates post-training and involves:
- Rate-Distortion Framework: Quantization is modeled as a constrained least-squares problem where bit depths for weight groups are optimized to balance distortion (accuracy loss) and bit rate (compression), using a Lagrangian approach with dual ascent updates.
- Stochastic Ascent Algorithm: Bit depths are iteratively adjusted using gradient variances computed via backpropagation on calibration data, with PCA and subsampling to reduce computational cost. The algorithm clamps bit depths between 0 and 8 bits and updates a dual variable to meet the target bit rate.
- Companding Quantization: A sigmoid transform is applied to weights before uniform quantization to handle non-uniform distributions (e.g., Laplace), reducing errors for more probable weights.
- Weight Grouping: Weights are grouped into smaller units (e.g., rows or columns) for fine-grained bit depth assignment, enhancing compression efficiency.
- Bias Correction: Systematic biases from quantization errors are mitigated by updating bias vectors based on running means of layer inputs. This method avoids fine-tuning during quantization, making it suitable for both weights and activations, unlike methods like GPTQ which require weight adjustments.
Experiment
The experiments evaluate ‘Radio’ on Meta OPT and Llama-2 model families, ranging from 125M to 70B parameters, quantized to 3-4 bits and fractional bit depths (2.1-2.8 bits). Calibration data (128 examples from C4) and test datasets (WikiText2, C4) are used for next-token prediction (perplexity), alongside downstream tasks like GSM8K and common-sense QA (ARC, HellaSwag, PIQA, WinoGrande). Setup includes batch sizes of 16, token counts of 17, and group sizes of 256-512, with optimization up to 64 iterations. Results show ‘Radio’ outperforms baselines (RTN, GPTQ, OWQ, AWQ, QuIP, OmniQuant, SqueezeLLM) in perplexity, especially for smaller models (e.g., 4.55 PPL reduction for 3-bit OPT-125M) and at lower bit depths (e.g., 2.x-bit Llama-2). However, gains are marginal for larger models (e.g., 0.00-0.01 PPL for OPT-66B, Llama-2 70B). Downstream task performance is slightly better than GPTQ/AWQ, but RTN shows severe degradation despite similar perplexity, suggesting task-specific weaknesses not fully explored. The setup is comprehensive across model sizes and tasks, but the computational overhead (e.g., 47 minutes for Llama-2 7B vs. 18 minutes for OWQ/GPTQ on Nvidia A100) and signaling overhead for fine-grained grouping (up to 10.33% for smaller models) are notable drawbacks. The results match expectations for smaller models and lower bit depths but raise questions about scalability and practical benefits for larger models.
Further Thoughts
The rate-distortion framework in ‘Radio’ offers a compelling theoretical lens for LLM compression, potentially inspiring similar optimization approaches in other AI domains like vision foundation models where memory constraints are also critical. However, its practical utility is tempered by computational overhead, which could be a bottleneck in real-time applications or on-device inference scenarios. An interesting connection arises with federated learning, where model compression is vital for communication efficiency; adapting ‘Radio’ to federated settings could be a fruitful direction, though it would require addressing convergence issues in distributed environments. Additionally, the companding technique’s reliance on specific weight distributions prompts exploration into adaptive or learned transformations that generalize across diverse model architectures and data distributions. The marginal gains for larger models also suggest a need to investigate whether emergent abilities in LLMs (as discussed in scaling laws literature) interact with quantization effects, potentially revealing deeper insights into model compressibility. Finally, the discrepancy in downstream task performance (e.g., RTN’s poor GSM8K results) underscores the importance of diverse evaluation metrics beyond perplexity, aligning with recent critiques in AI evaluation literature about over-reliance on single metrics.