An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

This paper demonstrates that fine-tuning large language models to 1.58-bit ternary weights using extra RMSNorm layers and a gradual quantization schedule achieves superior cross-entropy loss and preserves reasoning performance, enabling deployment on commodity hardware without relying on complex knowledge distillation.

Large Language Model, Fine-tuning, Efficiency, Transformer

Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock

Milwaukee School of Engineering, MSOE Artificial Intelligence Club

Generated by grok-3

Background Problem

Large language models (LLMs) are computationally expensive to deploy due to their massive size, prompting research into quantization techniques to reduce memory and compute requirements. Post-training quantization (PTQ) is fast but often degrades accuracy, while quantization-aware training (QAT) offers better performance at the cost of additional training. Ternary quantization (1.58-bit) promises significant model compression but poses training stability challenges. Building on prior work like BitNet, this paper addresses the problem of fine-tuning ternary LLMs to maintain performance close to full-precision models, with a focus on enabling deployment on commodity hardware by drastically reducing VRAM usage.

Method

The proposed method fine-tunes pre-trained Transformer-based LLMs to ternary weights (-1, 0, +1) using a custom BitLinear layer with Straight-Through Estimator (STE) for gradient computation. Key components include: (1) replacing dense layers with BitLinear layers that apply ‘fake quantization’ to weights during the forward pass, normalizing inputs with RMSNorm to ensure scale consistency; (2) a gradual quantization schedule via a lambda parameter, transitioning from full-precision to ternary weights over training steps to avoid destabilizing loss spikes; (3) insertion of extra RMSNorm layers before each quantized linear layer to maintain consistent input distributions, critical for training stability; and (4) optional layer-wise knowledge distillation (KD) to align student (quantized) and teacher (full-precision) activations, though the authors prioritize direct fine-tuning with RMSNorm over KD due to better results. The method aims to ensure stable convergence and minimal performance loss at ultra-low precision.

Experiment

The experiments fine-tuned two Transformer-based LLMs, Qwen-1.5B and Llama3-8B, on the OpenThoughts-114k dataset for next-token prediction, updating all weights to ternary values. Three setups were compared: baseline QAT (no KD or extra norms), QAT with layer-wise KD, and the proposed direct QAT with extra RMSNorm layers, all using the same gradual quantization schedule. Results showed the proposed method achieved the lowest cross-entropy loss, outperforming both baseline and KD approaches, with ablation studies confirming that removing extra RMSNorm layers led to unstable training or higher loss. Downstream evaluation on mathematical reasoning benchmarks (AIME-2024 and MATH-500) indicated negligible accuracy drops compared to full-precision baselines, suggesting preserved reasoning capabilities. However, the experimental setup lacks comparison with other state-of-the-art quantization methods beyond BitNet, and the benchmarks’ representativeness is unclear. While the results align with expectations of minimal performance degradation, the limited scope of models and tasks tested raises questions about generalizability.

Further Thoughts

While the addition of extra RMSNorm layers appears to be a simple yet effective solution for stabilizing ternary quantization, it prompts deeper questions about why input distribution consistency is so critical at ultra-low precision and whether this insight could apply to other low-bit quantization schemes or even non-Transformer architectures. The dismissal of knowledge distillation as less effective is surprising given its prominence in prior quantization literature; this could be explored further by testing hybrid approaches that combine RMSNorm with selective KD for specific layers or tasks. Additionally, the practical implication of fitting large models on commodity hardware connects to broader trends in democratizing AI, but it would be insightful to investigate if this method introduces latency or inference overheads that could offset memory gains in real-world deployments. Relating this to other areas, such as federated learning, one might consider if ternary quantization with RMSNorm could reduce communication costs in distributed training environments, opening new research avenues for efficient, privacy-preserving LLM fine-tuning.