COSMOS: Predictable and Cost-Effective Adaptation of LLMs

COSMOS introduces a cost-effective framework to predict performance and cost of LLM adaptation strategies like QLoRA fine-tuning and retrieval-augmented ICL, achieving high accuracy (1.09% MAE) and reducing computational costs by 92.72% across eight diverse benchmarks.

Large Language Model, Fine-tuning, In-Context Learning, Efficiency, Prediction, Cost Analysis

Jiayu Wang, Aws Albarghouthi, Frederic Sala

University of Wisconsin-Madison

Generated by grok-3

Background Problem

The rapid proliferation of large language models (LLMs), with over 1.6 million models on platforms like Hugging Face, alongside diverse adaptation strategies, poses a significant challenge in selecting the optimal model and strategy combination for downstream tasks under resource constraints. The key problem addressed is the strategy selection problem, which involves balancing performance and computational cost without resorting to expensive exhaustive experimentation. This work aims to predict both performance and cost of adaptation strategies efficiently, thereby reducing the computational overhead of LLM deployment while maintaining high performance standards.

Method

COSMOS (COSt-effective MOdel–Strategy prediction) is a unified framework designed to predict the performance and cost of LLM adaptation strategies without extensive trials. Its core idea is to use lightweight, strategy-specific predictors to estimate outcomes for different model-strategy combinations. For training-time adaptation like QLoRA fine-tuning, COSMOS employs an embedding-augmented linear proxy model that uses frozen embeddings from the base model and a task-specific linear projector, calibrated on a small validation set to predict fine-tuning gains. For test-time adaptation like retrieval-augmented in-context learning (ICL), it leverages scaling laws to fit an exponential saturation curve based on sparse early measurements, predicting performance with varying demonstration counts. The framework also includes a detailed cost analysis model, accounting for prediction, adaptation, and evaluation costs using both computing-based and token-based methods, ensuring cost-efficiency by minimizing the total computational overhead compared to exhaustive search.

Experiment

The experiments evaluate COSMOS across eight diverse benchmarks, including general tasks (e.g., MMLU, Winogrande) and financial domain tasks (e.g., FPB, FiQA-SA), using two models: Gemma 2B and Llama 3 8B. The setup tests 55 strategy combinations of QLoRA fine-tuning and retrieval-augmented ICL under low, medium, and high cost regimes, with configurations like training iterations, data portions, and demonstration counts systematically varied. Results show COSMOS achieves a high prediction accuracy with a mean absolute error (MAE) of 1.09% and reduces computational costs by an average of 92.72%, with up to 98.71% savings in resource-intensive scenarios. The experimental design is comprehensive in task diversity and cost bands, and the cost reduction is significant, aligning with the expectation of efficiency. However, the reliance on only two models limits generalizability, and the MAE, though low, may still impact precision-critical applications. The setup also effectively demonstrates strategy-specific prediction strengths and trade-offs between training and test-time adaptations, though hybrid strategies are underexplored.

Further Thoughts

The COSMOS framework presents a promising direction for cost-effective LLM deployment, particularly in industrial settings where computational budgets are a critical concern. However, its reliance on specific predictors raises questions about adaptability to emerging adaptation strategies or novel model architectures beyond the tested Gemma 2B and Llama 3 8B. An insightful connection could be drawn to recent works on AutoML and Neural Architecture Search (NAS), where automated hyperparameter optimization might complement COSMOS by dynamically refining predictor models. Additionally, integrating environmental cost metrics, such as carbon footprint, could align COSMOS with broader AI ethics and sustainability goals, an area gaining traction in AI for Science and Responsible AI research. Another avenue for exploration is the potential of COSMOS in federated learning scenarios, where adaptation strategies must account for distributed data and privacy constraints—could lightweight predictors be adapted for such decentralized settings? These considerations could significantly enhance the framework’s applicability and impact in real-world deployments.