When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator

This paper demonstrates that a 1.5B parameter reasoning model (Distill-R1) outperforms larger non-reasoning LLMs as a discriminator in a text-to-SQL planning framework by leveraging a novel soft score extraction method from chain-of-thought outputs, though it struggles significantly as a generator.

Large Language Model, Reasoning, Planning, Multi-Agent, Classification

Md Fahim Anjum

University of California San Francisco

Generated by grok-3

Background Problem

The paper addresses the integration of reasoning-capable Large Language Models (LLMs) into planning frameworks, focusing on their underexplored potential as discriminators compared to traditional non-reasoning LLMs. The key problem is understanding whether reasoning models, which leverage chain-of-thought (CoT) reasoning, can outperform larger non-reasoning models in evaluating candidate solutions within a generator-discriminator architecture, specifically for the text-to-SQL task. This work aims to solve the gap in systematic evaluation of reasoning models in agentic frameworks, hypothesizing that their structured reasoning abilities could enhance discrimination accuracy and overall planning performance despite smaller model sizes.

Method

The core method involves deploying a distilled 1.5B parameter reasoning model (Distill-R1) as a discriminator within a generator-discriminator LLM planning framework for text-to-SQL tasks. The framework operates in three stages: generation of candidate SQL queries by a generator LLM, evaluation of these candidates by a discriminator LLM, and re-ranking based on scores for final selection. A novel approach is introduced to extract soft scores from the CoT outputs of the reasoning model by prompting it to output decisions in a JSON format (key ‘correct’ with ‘true’ or ‘false’), parsing the logits, and normalizing them via softmax to derive a probability score for ranking candidates. This contrasts with non-reasoning models, which use straightforward token-level probabilities for ‘Yes’ responses. The method also tests variations like naive discrimination (no additional info) and enhanced discrimination (with executability checks), alongside factors such as test-time compute budget and contextual augmentation with database schemas.

Experiment

The experiments were conducted on a subset of 400 examples from the Spider dataset for text-to-SQL tasks, uniformly sampled across difficulty levels to ensure balance, though this subset might limit generalizability. The setup compares Distill-R1-1.5B against non-reasoning LLMs like CodeLlama-7B and 13B, using intrinsic metrics (e.g., F1, Hit@1) for discrimination and end-to-end metrics (e.g., execution accuracy) for overall performance. Results show Distill-R1 achieving up to 87% higher F1 and 3.7% better discrimination accuracy than CodeLlama-7B, and 3.7% higher execution accuracy than CodeLlama-13B, which is impressive given the size disparity. However, increasing compute budget beyond 1024 tokens or adding schema context yielded diminishing returns (<0.4% gain), and very low budgets led to severe performance drops (<2% accuracy). As a generator, Distill-R1 underperformed compared to even smaller non-reasoning models (e.g., 56.9% lower execution accuracy than TinyLlama-1.1B). The setup is reasonable for resource constraints but lacks depth in exploring why reasoning limits exist or testing across diverse tasks, and the comparison might be biased if non-reasoning models weren’t similarly optimized. Overall, results partially match the hypothesis of superior discrimination but highlight unexpected generation weaknesses.

Further Thoughts

The findings of this paper open up intriguing avenues for rethinking the design of agentic frameworks, particularly the specialized roles of reasoning versus non-reasoning models. The superior discrimination performance of Distill-R1 suggests that reasoning models might be best utilized in evaluation-heavy tasks, potentially extending beyond text-to-SQL to domains like automated debugging or decision-making systems where nuanced judgment is critical. However, their poor generation performance raises questions about whether this is a fundamental limitation of reasoning architectures or a training artifact—could hybrid models combining reasoning for discrimination and non-reasoning for generation be more effective? Additionally, the diminishing returns with increased compute budget resonate with broader scaling law discussions in LLMs, hinting at potential inefficiencies in how reasoning is implemented or elicited. This connects to recent works on emergent abilities in LLMs, where beyond a certain scale or compute threshold, qualitative improvements plateau. Future research could explore fine-tuning reasoning models specifically for discrimination tasks or investigate mechanistic reasons for reasoning limits, perhaps by analyzing attention patterns in CoT outputs. Cross-domain testing is also crucial to validate if these insights hold in less structured tasks like natural language reasoning or creative problem-solving.