Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards

REWARD-SQL introduces a framework for Text-to-SQL by decomposing queries into Chain-of-CTEs and using Process Reward Models (PRMs) with GRPO and Best-of-N sampling, achieving a state-of-the-art 68.9% execution accuracy on the BIRD dataset with a 7B model.

Large Language Model, Reinforcement Learning, Reasoning, Process Supervision, Text-to-SQL

Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, Guoliang Li

Renmin University of China, HKUST (GZ), Alibaba Cloud Computing, Tsinghua University

Generated by grok-3

Background Problem

The paper addresses the Text-to-SQL task, which aims to convert natural language queries into executable SQL statements, a critical challenge for enabling non-technical users to interact with databases. Despite advances in large language models (LLMs), issues like hallucinations in extended reasoning chains persist, often leading to incorrect SQL generation. The authors identify a gap in effectively applying reward models, particularly Process Reward Models (PRMs), to provide fine-grained supervision during reasoning steps. The key problem solved is designing a PRM tailored for Text-to-SQL that guides step-by-step reasoning without distorting the process, alongside exploring optimal integration strategies for training and inference to enhance SQL generation accuracy.

Method

The REWARD-SQL framework introduces a novel reasoning paradigm called Chain-of-CTEs (COCTE), which decomposes complex SQL queries into a sequence of Common Table Expressions (CTEs) for structured, interpretable reasoning. The approach follows a three-stage process: (1) a ‘cold start’ via supervised fine-tuning (SFT) on COCTE-formatted data to establish a baseline reasoning capability; (2) training a PRM to evaluate the correctness of each CTE step using step-level binary classification with Monte Carlo estimation and syntax tree edit distance for data diversity; (3) exploring four PRM integration strategies—offline methods (Rejection Sampling and Direct Preference Optimization), an online method (Group Relative Policy Optimization, GRPO), and inference-time scaling (Best-of-N sampling)—to optimize the policy model using a combined reward of process quality (PR) and outcome correctness (OR). The core idea is to leverage PRM for detailed feedback on reasoning steps while balancing training stability and inference quality, with GRPO and Best-of-N emerging as the most effective combination.

Experiment

Experiments were conducted on the BIRD dataset, a challenging Text-to-SQL benchmark with 9,428 training and 1,534 development pairs, focusing on complex queries and databases. The setup evaluates execution accuracy (EX) under greedy decoding (single output) and vote@n (Best-of-N with PRM selection) strategies. Results show that REWARD-SQL, using Qwen2.5-Coder-7B-Instruct, achieves 68.9% EX accuracy on the BIRD dev set with GRPO and Best-of-32, a 13.1% improvement over baseline SFT (54.4%), outperforming comparable 7B models and even larger models like GPT-4 in some settings. The experimental design is comprehensive, comparing offline (RS, DPO) and online (GRPO) methods, revealing GRPO’s superiority due to its use of data from all PR-OR quadrants, though RS underperforms due to limited exploration. Best-of-N significantly boosts performance (e.g., +12.6% for SFT), but its computational cost (generating 32 candidates) is a concern. Ablation studies in the appendix confirm PRM’s effectiveness over alternatives, and zero-shot generalization on Spider (81.7% EX) is strong, though results may be dataset-specific. While the improvement is obvious, the setup’s reliance on a specific model and dataset raises questions about broader applicability, and the high inference cost of Best-of-N may limit practical deployment.

Further Thoughts

The REWARD-SQL framework’s structured reasoning via Chain-of-CTEs offers a compelling approach to Text-to-SQL, potentially inspiring similar decomposition strategies in other code generation tasks, such as programming or query optimization in different languages. However, the risk of reward hacking in GRPO, where models might over-optimize for PRM scores at the expense of genuine SQL correctness, warrants further investigation—could this be mitigated by hybrid reward designs incorporating more rule-based checks? Additionally, the high computational cost of Best-of-N sampling (generating 32 candidates) suggests a need for efficiency-focused research, perhaps by integrating adaptive sampling or lightweight PRMs. Relating this to broader AI trends, the step-wise supervision aligns with efforts in explainable AI, where intermediate steps enhance interpretability, as seen in recent works on mathematical reasoning with PRMs. A cross-domain study comparing Text-to-SQL PRMs with those in mathematics or code generation could reveal shared challenges or transferable techniques, especially regarding generalization across diverse data schemas or problem types. Finally, exploring federated learning for PRM training could address privacy concerns in real-world database applications, ensuring sensitive schema data isn’t centralized during model development.