Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Insight-V introduces a scalable data generation pipeline and a multi-agent system with iterative DPO training to significantly enhance long-chain visual reasoning in MLLMs, achieving up to 7.0% performance gains on challenging benchmarks while maintaining perception capabilities.

Multimodal Systems, Reasoning, Large Language Model, Data Augmentation, Reinforcement Learning, Human-AI Interaction

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu

S-Lab, Nanyang Technological University, Tencent, Tsinghua University, Nanjing University

Generated by grok-3

Background Problem

The paper addresses the challenge of enabling Multimodal Large Language Models (MLLMs) to perform human-level long-chain visual reasoning, a critical gap in current AI capabilities for handling complex multimodal data. Despite advancements in MLLMs for tasks like visual understanding and question-answering, the lack of large-scale, high-quality reasoning datasets and effective training pipelines hinders progress in detailed step-by-step visual reasoning. Existing methods, often adapted from text-based Chain-of-Thought (CoT) approaches, show limited effectiveness due to insufficient structured data and the difficulty of balancing reasoning with visual perception in a single model. Insight-V aims to solve these issues by introducing a scalable data generation pipeline for long-chain reasoning data and a multi-agent system to decompose reasoning tasks, thereby enhancing MLLM performance on complex visual reasoning benchmarks.

Method

Insight-V proposes a novel framework to enhance long-chain visual reasoning in MLLMs through three main components:

Scalable Data Generation Pipeline: This involves a progressive strategy to generate structured, long-chain reasoning data using a reasoning generator that iteratively produces reasoning steps in JSON format, guided by actions (‘continue’ or ‘summary’). A multi-granularity assessment system, leveraging advanced models like Qwen2 and Qwen2-VL, filters and scores reasoning paths based on correctness and detail, ensuring high-quality data without human labor.
Multi-Agent System: The system decomposes problem-solving into two distinct roles— a reasoning agent generates detailed step-by-step reasoning processes, while a summary agent evaluates these processes and selectively answers the query. This separation aims to mitigate errors from flawed reasoning and improve robustness.
Two-Stage Training Pipeline: Initial supervised fine-tuning (SFT) trains both agents on curated datasets (200K for reasoning, 1.2M for summary), followed by an iterative Direct Preference Optimization (DPO) on the reasoning agent to align outputs with human preferences, enhancing reasoning quality over multiple rounds. The core idea is to mimic human-like reasoning by breaking down complex tasks into specialized sub-tasks, supported by high-quality structured data, to achieve better performance in visual reasoning while preserving perception capabilities.

Experiment

The experiments evaluate Insight-V on multiple visual reasoning benchmarks (e.g., MMMU, MMBench, ChartQA, MathVista) and general multimodal perception tasks (e.g., TextVQA, DocVQA). The setup integrates Insight-V with LLaVA-NeXT (8B) and a custom base model (7B, based on Qwen-2.5), using datasets of 200K and 1.2M samples for training reasoning and summary agents, respectively, with iterative DPO applied over 3 rounds. Results show significant improvements: a 7.0% average gain on LLaVA-NeXT and 2.9% on the base model across seven reasoning benchmarks, with specific gains like 9.1% on MME and 5.8% on ChartQA, indicating strong reasoning enhancement. On perception tasks, performance is maintained or slightly improved (e.g., up to 4.1% on OCRBench for LLaVA-NeXT), suggesting no compromise on basic visual understanding. Ablation studies confirm the multi-agent system’s superiority over single-model CoT or multi-turn approaches, with data scaling benefiting reasoning quality. However, the experimental setup, while comprehensive, may lack diversity in real-world scenarios, and the reliance on specific benchmarks raises questions about generalizability. The iterative DPO shows incremental gains (0.6% per round), but risks of overfitting or diminishing returns are not addressed. Overall, results match the expectation of improved reasoning, though computational overhead and scalability remain concerns.

Further Thoughts

The multi-agent approach in Insight-V, decomposing reasoning and summarization, opens up fascinating avenues for AI system design beyond MLLMs, potentially applicable to domains like robotics or multi-agent reinforcement learning where task decomposition could enhance performance. However, the heavy reliance on external advanced models for data assessment (e.g., Qwen2-VL) raises concerns about bias propagation and scalability—could a self-assessment mechanism within the system reduce this dependency? Additionally, the computational cost of training two full-scale agents is a significant barrier; exploring asymmetric architectures (e.g., a lightweight summary agent) could align with trends in efficient AI systems, as seen in recent works on parameter-efficient fine-tuning. The iterative DPO’s effectiveness also prompts questions about its limits—how many iterations are optimal before performance plateaus or overfitting occurs? This could tie into broader research on reinforcement learning stability in large models. Finally, the robustness to flawed reasoning paths, while promising, needs testing under adversarial conditions or with noisier real-world data, connecting to ongoing discussions in trustworthy AI about model reliability under uncertainty. These aspects warrant deeper exploration to refine and generalize Insight-V’s contributions.