This paper introduces a two-stage LLM compression method using RPCA for low-rank and sparse decomposition and probabilistic pruning via policy gradient, outperforming state-of-the-art techniques at a 50% compression ratio while automatically adapting to layer-wise redundancy without manual thresholds or extensive fine-tuning.
Large Language Model, Efficiency, Pre-training, Transformer
Changhai Zhou, Qian Qiao, Weizhong Zhang, Cheng Jin
Fudan University, Soochow University
Generated by grok-3
Background Problem
The rapid advancement of Transformer-based Large Language Models (LLMs) has led to remarkable achievements in NLP, vision, and multimodal tasks. However, their massive parameter sizes result in significant storage, memory, and computational challenges, hindering real-world deployment under hardware constraints. Existing compression strategies like quantization and pruning often struggle with performance degradation at high compression levels, while low-rank plus sparse decompositions face issues with manual threshold setting and lack of global optimization across layers. This paper aims to address these challenges by proposing a novel compression method that automates rank and sparsity allocation across layers and manages the interaction between low-rank and sparse components, thereby minimizing performance loss under stringent parameter budgets.
Method
The proposed method, termed CAP (Composite Approximation with Probabilistic pruning), operates in two stages to compress LLMs. In the first stage, Robust Principal Component Analysis (RPCA) decomposes each weight matrix into a low-rank component (L), capturing global patterns, and a sparse component (S), representing local outliers, using nuclear norm and L1 penalties to automatically determine rank and sparsity without heuristic thresholds. This decomposition is optimized via the Alternating Direction Method of Multipliers (ADMM), reducing the search space significantly. In the second stage, a probabilistic global optimization technique employs Bernoulli random variables to decide which singular values in L and nonzero entries in S to retain, with retention probabilities learned via policy gradient (REINFORCE) on a small calibration set to minimize loss under a parameter budget. The final compressed matrix is reconstructed by factorizing the retained low-rank component into smaller matrices and applying binary masks to the sparse component, ensuring efficiency in storage and inference.
Experiment
The experiments evaluate the proposed method on LLaMA (7B, 13B, 30B), LLaMA-2 (7B, 13B), BERT-base, and DeBERTaV3-base models, focusing on perplexity (PPL) on WikiText and zero-shot accuracy across six commonsense benchmarks (e.g., PIQA, BoolQ). The setup uses a 50% compression ratio as the primary evaluation point, with a calibration set of 128 sequences from C4, and compares against baselines like SparseGPT, WANDA, BESA, LPAF, and LoSparse. Results show that the method outperforms these baselines in PPL and zero-shot accuracy at 50% compression, demonstrating effectiveness. Ablation studies reveal that only a few RPCA iterations are needed for effective decomposition, and heuristic threshold-based pruning leads to severe performance drops, validating the probabilistic approach. However, the experimental setup lacks detailed analysis of higher compression ratios in the main text (relegated to appendices), and an unexpected performance boost when compressing the last layer of LLaMA2-7B suggests potential evaluation artifacts. While the setup is reasonable for initial validation, it is not comprehensive enough to assess scalability across diverse architectures or extreme compression scenarios, and the focus on LLaMA models limits generalizability claims.
Further Thoughts
While the proposed method shows promise in automating LLM compression, I believe its integration with other techniques like quantization or knowledge distillation, as suggested by the authors, could be a fruitful direction. For instance, combining this approach with post-training quantization could further reduce model size without significant performance loss, especially for edge device deployment. Additionally, exploring the method’s applicability to emerging foundation models beyond LLaMA, such as vision-language models or state-space models, could reveal new challenges and opportunities, particularly in handling multimodal weight structures. Another insightful connection is to dynamic sparsity methods like DST, where sparsity evolves during training; adapting the probabilistic pruning stage to a dynamic training context might mitigate the performance collapse observed at high compression ratios. Lastly, the unexpected performance boost in the last layer compression of LLaMA2-7B prompts a deeper investigation into layer-specific redundancy patterns—could this indicate that certain layers encode more noise than information, and if so, how can this insight inform pre-training strategies to inherently reduce redundancy? These questions highlight the potential for this work to inspire broader research into adaptive compression frameworks.