VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

This paper introduces VLM Q-Learning, an offline-to-online reinforcement learning method that fine-tunes Vision-Language Models for interactive decision-making by filtering suboptimal actions with a critic head, achieving significant performance improvements over supervised fine-tuning across multiple multimodal agent tasks.

Reinforcement Learning, Vision Foundation Model, Multimodality, Agent, Fine-Tuning, Decision Making

Jake Grigsby, Yuke Zhu, Michael Ryoo, Juan Carlos Niebles

The University of Texas at Austin, Salesforce AI Research

Generated by grok-3

Background Problem

Vision-Language Models (VLMs) extend Large Language Models (LLMs) to handle multimodal inputs, enabling applications in interactive decision-making environments like computer automation. However, open-weight VLMs lag behind LLMs in critical agent skills such as adhering to strict action syntax and managing long-context interactions, often failing to align with specific task objectives due to their general pre-training. This work addresses the challenge of fine-tuning VLMs for agent tasks by leveraging reinforcement learning (RL) to improve decision-making beyond the limitations of supervised fine-tuning (SFT), which cannot outperform its training data and struggles with suboptimal or noisy datasets.

Method

The proposed method, VLM Q-Learning (VLMQ), adapts VLMs for agent tasks using an offline-to-online RL framework with Advantage-Filtered Supervised Fine-Tuning (AFSFT). The core idea is to treat the VLM as an RL policy that maps text and image observations to text actions, fine-tuning it to maximize environmental rewards. It introduces a dual-head architecture: a language head (actor) for token generation and a critic head for estimating future returns (Q-values) to filter suboptimal actions via an advantage threshold. Key steps include: 1) Converting turn-based VLM-environment interactions into token-based RL transitions for granular optimization; 2) Using a filtered SFT loss to avoid imitating poor decisions from the dataset, guided by the critic’s advantage estimates; 3) Training both heads simultaneously with a joint loss balancing actor and critic objectives, supported by techniques like LoRA for parameter efficiency. This method aims to replace traditional SFT by allowing self-improvement over suboptimal datasets while maintaining compatibility with standard prompting strategies like chain-of-thought reasoning.

Experiment

The experiments evaluate VLMQ on two open-weight VLMs (MoonDream2 and xGen-MM) across three multimodal agent domains: Gym Cards (Blackjack and NumberLine tasks), BabyAI (gridworld navigation), and BrowserGym MiniWoB (27 browser click tasks). Datasets include offline collections from random policies and base VLM outputs, often noisy or suboptimal, with online data collection enabled in some settings. The setup tests action syntax accuracy and task success rate, comparing VLMQ (AFSFT) against SFT and base model prompting. Results show VLMQ significantly improves performance over SFT, especially in noisy datasets like MiniWoB, where MoonDream2’s success rate approaches that of larger models like xLAM after fine-tuning (e.g., median success rate increases from 0% to over 30% across tasks). In Gym Cards, VLMQ matches or exceeds reference RL4VLM scores, and in BabyAI, online RL achieves near 100% success with exploration strategies. The experimental design is reasonably comprehensive, covering diverse tasks and data conditions, but lacks detailed analysis of failure cases or robustness to different dataset qualities. The improvement is clear, though the reliance on specific prompting and parsing strategies might limit generalizability, and the offline-to-online transition stability is not deeply probed for potential distribution shifts.

Further Thoughts

The VLM Q-Learning approach opens intriguing avenues for aligning multimodal models with specific tasks, particularly in domains where demonstration data is imperfect or scarce, which is common in real-world applications like web automation or robotics. The use of a critic head to filter actions is reminiscent of actor-critic methods in traditional RL, but its application to token-based action spaces in VLMs is novel and could inspire similar hybrid techniques for other generative models, such as those in audio or video domains. However, a deeper exploration of the critic’s calibration in large token vocabularies is needed, as miscalibrated Q-values could mislead the filtering process, especially in online settings where distribution shifts are frequent. Additionally, connecting this work to recent advancements in RLHF (Reinforcement Learning from Human Feedback) could provide insights into incorporating human preferences or safety constraints into VLM agents, potentially addressing ethical concerns in automated decision-making. Comparing VLMQ’s efficiency and scalability with proprietary models or exploring its integration with emergent abilities in larger foundation models could further validate its practical impact. Lastly, the reliance on LoRA for parameter efficiency raises questions about whether full fine-tuning on more powerful hardware could uncover additional performance gains or reveal limitations in the current approach.