Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

The Video Prediction Policy (VPP) introduces a novel generalist robot policy that leverages predictive visual representations from fine-tuned video diffusion models to learn implicit inverse dynamics, achieving significant improvements of 41.5% on the Calvin ABC→D benchmark and 31.6% in real-world dexterous manipulation tasks over state-of-the-art baselines.

Reinforcement Learning, Generative AI, Diffusion Model, Robotics, Prediction, Multimodal Systems

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen

IIIS, Tsinghua University, Shanghai AI Lab, Shanghai Qi Zhi Institute, RobotEra, University of California, Berkeley

Generated by grok-3

Background Problem

The development of generalist robot policies capable of handling diverse tasks is a critical challenge in robotics. A key component is the vision encoder, which processes pixel observations into actionable representations. Traditional vision encoders, often pre-trained with single-image reconstruction or two-image contrastive learning, focus on static information and fail to capture the dynamic aspects essential for embodied tasks. Recent advancements in video diffusion models (VDMs) have shown strong capabilities in predicting future frames, suggesting an inherent understanding of physical dynamics. This paper addresses the problem of leveraging these predictive capabilities to enhance robotic policies by hypothesizing that VDMs can provide visual representations encompassing both current and future states, thus offering valuable guidance for action learning in robotics.

Method

The Video Prediction Policy (VPP) employs a two-stage learning process to develop a generalist robot policy using predictive visual representations from video diffusion models (VDMs). In the first stage, a pre-trained video foundation model (Stable Video Diffusion with 1.5 billion parameters) is fine-tuned into a Text-guided Video Prediction (TVP) model using internet human manipulation data, robot manipulation data, and task-specific datasets, incorporating language features via CLIP embeddings and adjusting video resolution for efficiency. The fine-tuning optimizes a diffusion objective to predict future video sequences based on initial frames and instructions. In the second stage, the TVP model serves as a vision encoder by performing a single forward pass (avoiding time-consuming denoising) to extract predictive representations from up-sampling layers, which are then aggregated across multiple layers and views (e.g., static and wrist cameras) using interpolation and concatenation. A Video Former module processes these high-dimensional representations into fixed tokens via spatial-temporal attention, and a diffusion policy head generates action sequences conditioned on these tokens, learning an implicit inverse dynamics model to track robot movements in predicted futures. This approach ensures high-frequency closed-loop control by minimizing computational overhead.

Experiment

The experiments evaluate VPP across simulated and real-world robotic tasks to assess its effectiveness with predictive visual representations. In simulation, VPP was tested on the Calvin ABC→D benchmark (a long-horizon instruction-following task with unseen environment D) and the MetaWorld benchmark (50 diverse manipulation tasks with a Sawyer robot). Datasets included internet human and robot manipulation videos (over 370,000 trajectories) alongside task-specific data, with fine-tuning and policy training conducted on NVIDIA A100 GPUs. Baselines included state-of-the-art methods like GR-1, Susie, and Diffusion Policy. Results showed VPP achieving a 41.5% improvement in average task completion length (4.33 vs. 3.06 for GR-1) on Calvin ABC→D and a 10.8% higher success rate (0.682 vs. 0.574 for GR-1) on MetaWorld, demonstrating significant superiority. Ablation studies confirmed the importance of predictive representations (performance dropped to 2.58 with Stable-VAE), internet pre-training (drop to 1.63 without SVD pre-training), and architectural components like Video Former (drop to 3.86). Real-world tests on Franka Panda arm and Xarm with dexterous hand across seen, unseen, and tool-use tasks (over 700 rollouts) showed a 31.6% average improvement in success rate over GR-1, with strong generalization to unseen tasks. The setup appears comprehensive, covering diverse tasks and environments, and results align with expectations of leveraging predictive dynamics. However, the definition of ‘unseen tasks’ lacks rigor, and latency (140ms per step) may still limit applicability in highly dynamic scenarios. Visualizations suggest single-step predictions capture useful dynamics, but texture inaccuracies could affect precision in complex tasks.

Further Thoughts

The VPP approach opens up fascinating avenues for integrating generative AI with robotics, particularly through the use of video diffusion models to predict future states. However, a deeper exploration of failure modes is warranted—under what conditions do predictive representations fail to capture critical dynamics, especially in highly stochastic or rapidly changing environments like autonomous driving or disaster response robotics? Additionally, the reliance on internet-scale pre-training raises questions about data bias and ethical implications; for instance, if the training data over-represents certain types of manipulation tasks or environments, could this limit generalization in underrepresented scenarios? Connecting this to broader AI research, VPP’s methodology could intersect with advancements in multimodal foundation models, where combining video prediction with audio or tactile data might further enhance robotic perception and decision-making. I’m also intrigued by potential parallels with reinforcement learning in game environments, where predictive models of future states (e.g., in Monte Carlo Tree Search) have driven breakthroughs—could VPP’s predictive representations be adapted to such domains for planning under uncertainty? Finally, the computational cost, though mitigated, remains a concern; exploring parameter-efficient fine-tuning techniques or smaller proxy models for real-time inference could be a valuable next step to democratize this approach for resource-constrained robotic systems.