RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

RWKVQuant introduces a tailored Post Training Quantization framework for RWKV models, using a coarse-to-fine proxy to hybridize scalar and vector quantization and optimizing codebooks for element-wise operations, achieving ~3-bit quantization with minimal accuracy loss and significant memory and speed improvements.

Large Language Model, Efficiency, Pre-training, Multimodality

Chen Xu, Yuxuan Yue, Zukang Xu, Xing Hu, Jiangyong Yu, Zhixuan Chen, Sifan Zhou, Zhihang Yuan, Dawei Yang

Houmo AI, Harbin Institute of Technology (Shenzhen)

Generated by grok-3

Background Problem

RWKV, a modern sequence model blending Recurrent Neural Networks (RNNs) and Transformer capabilities, offers comparable performance to Transformer-based Large Language Models (T-LLMs) with efficient inference, making it a promising architecture for language and vision tasks. However, its large parameter size poses significant challenges for deployment on resource-constrained devices due to high memory demands and low compute-to-memory access ratios. While Post Training Quantization (PTQ) is a common technique to reduce model size and inference latency in T-LLMs, applying it to RWKV results in severe performance degradation due to two key issues: (1) non-linear operators in RWKV’s structure hinder parameter fusion in scalar quantization (SQ), increasing computational overhead, and (2) a higher proportion of uniformly distributed weights complicates cluster-based vector quantization (VQ), leading to accuracy loss. This paper aims to address these challenges by developing a tailored PTQ framework for RWKV models to enable efficient deployment without sacrificing performance.

Method

RWKVQuant is a Post Training Quantization (PTQ) framework specifically designed for RWKV models, integrating a hybrid strategy of Scalar Quantization (SQ) and Vector Quantization (VQ) guided by a novel coarse-to-fine proxy, alongside optimized VQ for element-wise multiplication. The core idea is to dynamically select the most suitable quantization method for each weight based on its distribution properties. The implementation involves two main components: (1) A coarse-to-fine proxy assesses weight uniformity using Information Entropy (IE) as a coarse-grained measure to identify generally non-uniform weights for VQ, and for uniform weights, a fine-grained proxy based on high-order central moments detects local outliers to decide between SQ (for uniform weights without outliers) and VQ (for weights with outliers). This hybrid approach minimizes computational complexity from O(2^M) to O(M) compared to exhaustive search. (2) A codebook optimization for element-wise multiplication, unique to RWKV, uses weighted K-Means guided by squared activation values and percentile-based clipping to handle activation outliers, enhancing quantization accuracy. Key steps include weight transformation, proxy computation, quantization method selection, and tailored VQ codebook generation, ensuring minimal performance loss at low bit-widths (~3-bit).

Experiment

The experiments evaluate RWKVQuant on various RWKV models (RWKV6, RWKV7, VR-WKV) across language tasks (using datasets like LAMBADA and nine zero-shot tasks) and vision tasks (using ImageNet, Coco, ADE20K). The setup compares RWKVQuant against SQ methods (RTN, GPTQ, AWQ, QuaRot) and VQ methods (K-Means, GPTVQ, VPTQ) at bit-per-weight (bpw) settings of 3.25 and 3.5, with calibration using 128 samples per task. Results show RWKVQuant consistently outperforms baselines, achieving less than 1% accuracy loss on RWKV6-14B at ~3-bit, with 2.83x memory savings and 2.14x speedup on an NVIDIA A6000 GPU. For language tasks, it reduces perplexity increase and maintains high zero-shot accuracy; for vision tasks, it achieves top scores in classification and segmentation. The experimental design is comprehensive, covering multiple model sizes and tasks, and the hybrid strategy’s effectiveness is validated through ablation studies on proxy selection and codebook optimization. However, the threshold settings (τ^c and τ^f) appear model-specific, raising concerns about generalizability, and the baseline comparisons might not fully account for potential optimizations of other methods for RWKV. Overall, the results align with expectations of improved quantization performance, though robustness across diverse scenarios needs further validation.

Further Thoughts

The RWKVQuant framework presents a compelling approach to quantization for RWKV models, particularly with its hybrid strategy that adapts to weight distribution characteristics. However, an insightful extension could explore how this coarse-to-fine proxy generalizes to other non-Transformer architectures with similar non-linear constraints, such as state-space models, which also face quantization challenges due to unique structural properties. Additionally, the uniform weight distribution issue in RWKV raises a broader question about whether pre-training strategies or architectural modifications could inherently reduce uniformity to favor VQ naturally, potentially complementing post-training methods like RWKVQuant. Linking to recent works on parameter-efficient fine-tuning, such as Low-Rank Adaptation (LoRA), one might investigate if integrating such techniques before quantization could further minimize performance loss by adjusting weight distributions pre-emptively. Finally, the codebook optimization for element-wise multiplication could inspire similar tailored approaches in multimodal systems where element-wise operations are prevalent, potentially bridging efficiency gains across domains like vision-language models. These connections highlight RWKVQuant’s potential as a stepping stone for broader quantization research in diverse AI architectures.