MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

This paper introduces MiMo-7B, a 7B-parameter LLM optimized for reasoning through innovative pre-training with reasoning-dense data and multi-token prediction, and post-training with RL using test-difficulty-driven rewards, achieving superior performance over larger models and OpenAI o1-mini on mathematics and coding benchmarks.

Large Language Model, Reinforcement Learning, Pre-training, Reasoning, Efficiency

Core Team, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue

Xiaomi LLM-Core Team

Generated by grok-3

Background Problem

The research addresses the challenge of enhancing reasoning capabilities in large language models (LLMs), particularly in smaller models like those with 7B parameters, where achieving high performance in complex tasks such as mathematical reasoning and code generation is considered difficult compared to larger models (e.g., 32B). Current successful RL approaches often rely on larger base models, and there is a gap in simultaneously improving mathematical and coding skills in smaller models. The key problem solved is unlocking the inherent reasoning potential of smaller LLMs through optimized pre-training and post-training strategies, aiming to outperform larger models and state-of-the-art reasoning models like OpenAI o1-mini.

Method

The methodology for MiMo-7B is divided into pre-training and post-training phases:

Pre-Training: Focuses on building a base model (MiMo-7B-Base) with strong reasoning potential by optimizing data preprocessing to increase reasoning pattern density (e.g., improved HTML extraction for math and code content), generating synthetic reasoning data, and employing a three-stage data mixture strategy over 25 trillion tokens. It also incorporates Multi-Token Prediction (MTP) as an additional training objective to enhance performance and inference speed using a single MTP layer during training and multiple layers for speculative decoding at inference.
Post-Training: Involves Supervised Fine-Tuning (SFT) on a curated dataset of 500K samples and Reinforcement Learning (RL) using Group Relative Policy Optimization (GRPO) on a dataset of 130K verifiable mathematics and code problems. RL innovations include a test-difficulty-driven reward system to address sparse rewards in code problems by assigning fine-grained scores based on test case difficulty, and an easy data re-sampling strategy to stabilize training by maintaining an easy data pool for occasional sampling. Additionally, a Seamless Rollout Engine is developed to optimize RL training efficiency by reducing GPU idle time through continuous rollout, asynchronous reward computation, and early termination.

Experiment

The experiments are structured to evaluate both the base and RL-tuned models across multiple benchmarks:

Setup: MiMo-7B-Base is tested on general reasoning (e.g., BBH, MMLU), mathematics (e.g., AIME, GSM8K), coding (e.g., LiveCodeBench, HumanEval), and long-context comprehension (e.g., RULER) against comparable 7B-9B models like Llama-3.1-8B and Qwen2.5-7B. MiMo-7B-RL is evaluated against stronger baselines including OpenAI o1-mini on similar tasks with settings like temperature 0.6 and max generation length up to 32,768 tokens.
Results: MiMo-7B-Base shows superior reasoning potential, scoring 75.2 on BBH and 32.9 on AIME 2024, significantly outperforming similar-sized models and even some 32B models in pass@k metrics. MiMo-7B-RL achieves top-tier results, with 55.4 on AIME 2025 (exceeding o1-mini by 4.7 points) and 57.8 on LiveCodeBench v5 (outperforming o1-mini). However, general performance on non-reasoning tasks remains competitive but not leading.
Analysis: The experimental setup is comprehensive for reasoning tasks but lacks diversity in non-reasoning benchmarks, raising concerns about overfitting to math and code domains. The improvement is obvious in targeted areas, yet the lack of ablation studies on individual method components (e.g., impact of each pre-training stage or reward scheme) limits understanding of their specific contributions. The results match the expectation of enhanced reasoning but require further validation on broader tasks to confirm generalizability.

Further Thoughts

The approach of MiMo-7B to enhance reasoning in smaller LLMs through data-centric pre-training and RL innovations opens up intriguing possibilities for scaling down model size without sacrificing performance, which could democratize access to powerful AI tools in resource-constrained environments. However, the heavy focus on mathematics and code reasoning raises questions about potential trade-offs in other domains like natural language understanding or creative tasks—could this specialization limit the model’s applicability in broader AI applications? Additionally, the test-difficulty-driven reward system, inspired by human competition scoring like IOI, suggests a novel intersection between educational assessment frameworks and AI training paradigms; exploring this further could lead to more human-aligned reward mechanisms in RL for AI. Comparing this with other RLHF (Reinforcement Learning from Human Feedback) approaches, such as those used in models like GPT-4o, might reveal whether rule-based rewards are inherently more robust against reward hacking than human preference-based systems. Lastly, the Seamless Rollout Engine’s efficiency gains hint at a broader need for system-level optimizations in RL training pipelines—could similar principles be applied to other computationally intensive AI training scenarios, such as in robotics or multimodal systems?