LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

LENSLLM introduces a Hessian-based PAC-Bayes framework and NTK-based scaling model for LLM selection, achieving up to 91.1% accuracy and 88.5% computational cost reduction by modeling fine-tuning dynamics across diverse tasks.

Large Language Model, Fine-tuning, Scaling Laws, Efficiency, Transformer, Prediction

Xinyue Zeng, Haohui Wang, Junhong Lin, Jun Wu, Tyler Cody, Dawei Zhou

Virginia Tech, Massachusetts Institute of Technology, Michigan State University

Generated by grok-3

Background Problem

The rapid increase in open-source Large Language Models (LLMs) and the diversity of downstream tasks have created a pressing need for efficient model selection, as fine-tuning all candidate models is computationally infeasible. Traditional model selection methods, designed for smaller-scale models, fail to generalize to LLMs due to high computational costs and poor performance in novel or out-of-distribution scenarios. This paper addresses two key challenges: the lack of theoretical understanding of fine-tuning dynamics (especially in low-data regimes) and the need for an accurate, efficient LLM selection framework that balances performance and computational cost across diverse tasks.

Method

The proposed method, LENSLLM, is a novel framework for LLM selection that models fine-tuning dynamics through a two-pronged approach. First, it introduces a Hessian-based PAC-Bayes generalization bound to theoretically analyze the transition between pre-power (initial fine-tuning with slow improvement) and power phases (stable, predictable performance scaling) during fine-tuning, using the Hessian matrix to capture parameter sensitivity and generalization behavior. Second, it develops an NTK-based Rectified Scaling Model that integrates Neural Tangent Kernel (NTK) approximations with scaling laws to predict test loss across dataset sizes, incorporating pre-trained knowledge and fine-tuning dynamics via a loss function $L(D) = \frac{B}{F(\Theta, t) + D^{\beta}} + E$ , where $F(\Theta, t)$ accounts for transformer learning dynamics. The implementation involves an iterative algorithm that trains models on progressively smaller datasets, fits a regression estimator for performance prediction, and selects the optimal model based on predicted scores, optimizing for both accuracy and efficiency.

Experiment

The experiments evaluate LENSLLM on three benchmarks (FLAN, Wikitext, Gigaword) using datasets of varying sizes (200 to 1,638,400 samples) across multiple LLM architectures (e.g., OPT, T5, GPT-2). The setup compares LENSLLM against five baselines (Rectified Scaling Law, NLPmetrics, SubTuning, ZeroShot, ModelSize) using metrics like Pearson Correlation (PearCorr) and Relative Accuracy (RelAcc). Results show LENSLLM achieves up to 91.1% RelAcc and 85.8% PearCorr, outperforming baselines significantly, with RMSE for test loss prediction being 3-5 times lower than the best competitor. Efficiency-wise, it reduces computational costs by up to 88.5% compared to FullTuning, measured in FLOPs. However, the experimental design may lack diversity in tasks beyond the chosen benchmarks, and the setup does not fully address robustness in out-of-distribution scenarios. While results match the expectation of improved accuracy and efficiency, the comprehensiveness of the evaluation could be questioned due to potential cherry-picking of favorable architectures and datasets.

Further Thoughts

While LENSLLM presents a compelling theoretical and practical approach to LLM selection, I believe its reliance on NTK approximations warrants further scrutiny, especially for extremely large models where NTK’s infinite-width assumptions may not hold. An interesting connection could be drawn to recent works on kernel methods for transformers, which suggest that NTK might oversimplify complex attention mechanisms—future research could explore hybrid approaches combining NTK with attention-specific metrics. Additionally, the phase transition theory could be extended to multi-task learning scenarios, where fine-tuning dynamics might exhibit more complex behaviors due to task interference; this could tie into studies on catastrophic forgetting in continual learning. Lastly, the environmental impact of even ‘efficient’ LLM selection processes remains a concern, and integrating energy consumption metrics into the efficiency analysis could align this work with broader responsible AI initiatives, potentially inspiring a new line of research on sustainable model selection frameworks.