Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

This paper introduces ModelSwitch, a multi-LLM repeated sampling strategy that leverages answer consistency to dynamically switch models, achieving superior performance and 34% sample efficiency over single-LLM self-consistency across diverse datasets.

Large Language Model, Multi-Agent, Efficiency, Reasoning, Test Time

Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Hangfan Zhang, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, Shuyue Hu

State Key Laboratory for Novel Software Technology, Nanjing University, Shanghai Artificial Intelligence Laboratory, The University of Auckland, The Pennsylvania State University

Generated by grok-3

Background Problem

The rapid advancement of large language models (LLMs) has been driven by scaling compute, but scaling training-time compute is reaching a plateau, making inference-time compute a promising alternative. A key challenge in inference-time scaling is the high computational cost and latency associated with repeated sampling to improve answer correctness, often requiring hundreds or thousands of samples. This paper addresses the problem of sample efficiency by questioning whether such extensive sampling is necessary and proposes a method to achieve high performance with fewer samples by leveraging multiple LLMs to exploit their complementary strengths.

Method

The proposed method, ModelSwitch, builds on the repeated-sampling-then-voting framework with a novel multi-LLM approach. Its core idea is to dynamically switch between multiple LLMs during sampling based on the consistency of their generated answers, as consistency correlates strongly with accuracy. The implementation involves: (1) selecting a diverse set of LLMs to maximize complementary capabilities; (2) allocating a fixed sampling budget equally across models (e.g., for n models and budget K, each samples K/n times); (3) sampling sequentially from each model and stopping if high consistency is achieved (indicating likely correctness), thus saving compute; and (4) aggregating answers using a weighted voting algorithm that considers internal weights (based on answer consistency via entropy) and external weights (based on prior model performance). This method aims to reduce unnecessary sampling by switching to another model when inconsistency suggests low confidence, thereby enhancing both efficiency and performance.

Experiment

The experiments were conducted on seven diverse datasets (GSM8K, MATH, MathBench, MGSM, DATE, MMLU-Pro, AIME24) covering reasoning, knowledge, and domain-specific tasks, using a mix of lightweight closed-source LLMs (e.g., GPT-4o mini, Gemini 1.5 Flash) and open-source models (e.g., Llama-3.1-8B-Instruct). The setup compared ModelSwitch against single-LLM self-consistency and multi-agent debate methods under fixed sampling budgets (e.g., 15 or 16 samples). Results showed ModelSwitch outperforming self-consistency in efficacy (e.g., 7-point boost on MathBench vs. 2.6-point for best single LLM) and efficiency (34% sample reduction on average across six datasets). It also surpassed advanced LLMs like GPT-4o with fewer samples and achieved state-of-the-art performance on four datasets, notably a 10.2-point gain on MMLU-Pro over the best single LLM. Scalability tests indicated optimal performance with just a few comparable LLMs, and robustness to model order was demonstrated. Integration with a reward model (Qwen2.5-MATH-RM-72B) further boosted accuracy (e.g., from 80% to 84% on MATH). While the experimental design is comprehensive, the selection of models and datasets might favor the method, and failure cases (e.g., consistent but wrong answers) are underexplored. Overall, results align with expectations of improved efficiency and performance, though real-world cost implications need further scrutiny.

Further Thoughts

The ModelSwitch approach opens up intriguing avenues for optimizing test-time compute in LLMs, particularly by exploiting model diversity without requiring retraining or internal fusion. However, I am curious about its applicability in scenarios where model access costs vary significantly—switching between a cheap and an expensive model might negate efficiency gains if the latter is frequently invoked. This connects to broader research in model routing and mixture-of-experts systems, where dynamic selection based on query characteristics is key. Additionally, the reliance on consistency as a signal might be vulnerable to systematic biases in certain domains (e.g., consistently wrong answers in niche tasks), suggesting a need for hybrid signals combining consistency with external validation. I also see potential in integrating ModelSwitch with federated learning paradigms, where diverse models trained on distinct data distributions could enhance complementary strengths, especially in privacy-sensitive applications. Lastly, exploring failure modes more deeply—such as when all models in the set are inconsistent or confidently wrong—could lead to robust safeguards, perhaps by incorporating uncertainty quantification techniques from Bayesian deep learning. These directions could further solidify the practical impact of multi-LLM strategies in real-world deployments.