Skip to content
Go back 2505.07608 arXiv logo

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

Published:  at  11:04 AM
86.60 🤔

This paper introduces MiMo-7B, a 7B-parameter LLM optimized for reasoning through innovative pre-training with reasoning-dense data and multi-token prediction, and post-training with RL using test-difficulty-driven rewards, achieving superior performance over larger models and OpenAI o1-mini on mathematics and coding benchmarks.

Large Language Model, Reinforcement Learning, Pre-training, Reasoning, Efficiency

Core Team, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue

Xiaomi LLM-Core Team

Generated by grok-3

Background Problem

The research addresses the challenge of enhancing reasoning capabilities in large language models (LLMs), particularly in smaller models like those with 7B parameters, where achieving high performance in complex tasks such as mathematical reasoning and code generation is considered difficult compared to larger models (e.g., 32B). Current successful RL approaches often rely on larger base models, and there is a gap in simultaneously improving mathematical and coding skills in smaller models. The key problem solved is unlocking the inherent reasoning potential of smaller LLMs through optimized pre-training and post-training strategies, aiming to outperform larger models and state-of-the-art reasoning models like OpenAI o1-mini.

Method

The methodology for MiMo-7B is divided into pre-training and post-training phases:

Experiment

The experiments are structured to evaluate both the base and RL-tuned models across multiple benchmarks:

Further Thoughts

The approach of MiMo-7B to enhance reasoning in smaller LLMs through data-centric pre-training and RL innovations opens up intriguing possibilities for scaling down model size without sacrificing performance, which could democratize access to powerful AI tools in resource-constrained environments. However, the heavy focus on mathematics and code reasoning raises questions about potential trade-offs in other domains like natural language understanding or creative tasks—could this specialization limit the model’s applicability in broader AI applications? Additionally, the test-difficulty-driven reward system, inspired by human competition scoring like IOI, suggests a novel intersection between educational assessment frameworks and AI training paradigms; exploring this further could lead to more human-aligned reward mechanisms in RL for AI. Comparing this with other RLHF (Reinforcement Learning from Human Feedback) approaches, such as those used in models like GPT-4o, might reveal whether rule-based rewards are inherently more robust against reward hacking than human preference-based systems. Lastly, the Seamless Rollout Engine’s efficiency gains hint at a broader need for system-level optimizations in RL training pipelines—could similar principles be applied to other computationally intensive AI training scenarios, such as in robotics or multimodal systems?



Previous Post
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Next Post
Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models