SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

This paper investigates zero RL training on diverse open base models, achieving significant accuracy and response length improvements while identifying key factors like reward design and data difficulty that influence the emergence of reasoning behaviors.

Reinforcement Learning, Large Language Model, Reasoning, Pre-training, Emergent Abilities

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He

Hong Kong University of Science and Technology (HKUST), TikTok, Beijing University of Posts and Telecommunications (BUPT)

Generated by grok-3

Background Problem

The paper addresses the challenge of understanding and enhancing reasoning capabilities in large language models (LLMs) through zero reinforcement learning (RL) training, a paradigm where training starts directly from base (pretrained) models without supervised fine-tuning. Inspired by DeepSeek-R1’s demonstration of emergent long chain-of-thought (CoT) reasoning and self-reflection behaviors (termed the ‘aha moment’) via RL with rule-based rewards, this work investigates whether such phenomena can be replicated across diverse, smaller open base models that may lack initial instruction-following abilities. Key problems solved include identifying critical factors for successful zero RL training, exploring the emergence of cognitive behaviors in varied model families, and assessing whether increased response length correlates with genuine reasoning improvements.

Method

The core method is zero RL training, which applies reinforcement learning directly on base models using the GRPO (Group-Normalized Reward Policy Optimization) algorithm without prior supervised fine-tuning. It operates by sampling multiple responses per query, optimizing a policy with a token-level objective that balances reward maximization with a KL divergence penalty to prevent excessive deviation from the reference model. Key design strategies include: (1) using a simple rule-based reward function (+1 for correct answers, 0 for incorrect) to avoid format constraints that hinder exploration; (2) adjusting training data difficulty (categorized as Easy, Medium, Hard) to match the base model’s capabilities; and (3) monitoring training dynamics with metrics like accuracy, response length, clip ratio, and reasoning behavior ratio (assessed via GPT-4o for behaviors such as verification and backtracking). The method was applied across 10 base models from different families and sizes, ensuring a broad evaluation scope.

Experiment

Experiments were conducted on 10 base models (e.g., Llama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5 series from 0.5B to 32B) using GSM8K and MATH datasets for training, split into Easy, Medium, and Hard difficulty levels (each ~8K problems). Evaluation spanned multiple benchmarks (GSM8K, MATH500, Minerva Math, OlympiadBench, AIME24, AMC23) and generalization tests (IFEVAL, MMLU, GPQA-Diamond). The setup used consistent hyperparameters across models, with GRPO as the RL algorithm. Results showed significant accuracy improvements (e.g., DeepSeek-Math-7B accuracy tripled from 11.3% to 29.2% average) and response length increases in 9 out of 10 models, except Qwen2.5-Math-7B due to context length limits. Pass@k accuracy improved by 10-30 points, suggesting genuine reasoning enhancement rather than mere reranking. However, increased response length did not always correlate with cognitive behaviors like verification, especially in Qwen2.5 models with pre-existing capabilities. The experimental design was comprehensive in model diversity but limited by small dataset size and simplistic reward design, potentially missing nuanced reasoning aspects. The reliance on GPT-4o for behavior analysis introduces potential subjectivity, though it provides deeper insights than superficial metrics.

Further Thoughts

The findings on zero RL training’s effectiveness across diverse base models open up intriguing avenues for future research, particularly in understanding how inherent model characteristics (e.g., pretraining data quality in Qwen2.5 vs. Llama) influence RL outcomes. This connects to broader discussions in AI research about the role of pretraining in shaping emergent abilities, as seen in works on scaling laws and foundation models. I wonder if integrating zero RL with parameter-efficient fine-tuning methods like LoRA could address exploration constraints without the performance ceiling imposed by traditional SFT, potentially offering a hybrid approach. Additionally, the subjectivity in reasoning behavior analysis via GPT-4o raises questions about developing more objective, automated metrics for cognitive behaviors—perhaps leveraging unsupervised clustering of response patterns. Finally, extending zero RL to multimodal domains or tasks beyond math reasoning (e.g., commonsense reasoning or code generation) could test its robustness and reveal whether the ‘aha moment’ is task-specific or a generalizable phenomenon, linking to ongoing efforts in AI for Science and multimodal systems.