Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Selftok introduces a non-spatial autoregressive visual tokenizer using diffusion timesteps, unifying vision-language models and enabling effective reinforcement learning for superior text-to-image generation, as demonstrated on GenEval and DPG-Bench benchmarks.

Autoregressive Modeling, Diffusion Model, Vision-Language Model, Reinforcement Learning, Image Generation, Multimodal Systems

Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li’an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

Media Technology Institute, Huawei Singapore

Generated by grok-3

Background Problem

The research addresses the challenge of unifying multimodal data (vision and language) into a single discrete autoregressive (dAR) framework, motivated by the impending exhaustion of language data for large language models (LLMs) and the underutilization of visual data. Traditional spatial tokenization methods for images are incompatible with the causal AR structure of language models, leading to inefficiencies in vision-language model (VLM) training and ineffective reinforcement learning (RL) for visual tasks due to non-AR dependencies violating policy improvement optimality. The key problem solved is the creation of a non-spatial, AR-based visual tokenizer (Selftok) that aligns visual representation with language models, enabling seamless integration into dAR architectures and supporting effective RL for visual generation without additional training objectives.

Method

Selftok, or Self-consistency Tokenizer, is a novel visual tokenizer that encodes images into discrete autoregressive (AR) tokens using the reverse diffusion process, abandoning spatial priors. Its core idea is to represent images as a sequence of tokens corresponding to diffusion timesteps, ensuring an AR causal structure akin to language models. The implementation involves: 1) An encoder (dual-stream transformer) that processes image latents and token embeddings to output continuous token representations; 2) A quantizer that maps these to discrete tokens using a codebook with vector quantization; 3) A decoder (diffusion-based) that reconstructs images from tokens conditioned on diffusion timesteps; 4) A token schedule to align tokens with diffusion steps for AR compliance; and 5) A one-step renderer for fast reconstruction post-training. The method optimizes a constrained objective combining reconstruction loss with an AR constraint, leveraging diffusion recursion to ensure tokens follow a causal dependency, theoretically supporting RL via the Bellman equation.

Experiment

The experiments evaluate Selftok across tokenization quality, VLM performance, and visual RL effectiveness. Datasets include ImageNet-1k for tokenizer training and validation, and benchmarks like GenEval and DPG-Bench for text-to-image generation, PIE-Bench for image editing, and MME for vision-language comprehension. The setup is comprehensive, testing reconstruction metrics (rFID, PSNR, SSIM, LPIPS), generation alignment scores, and comprehension accuracy, with ablations on token count, codebook size, and sampling strategies. Results show Selftok achieves state-of-the-art reconstruction (e.g., PSNR 26.30 with 1024 tokens vs. FlowMo-Hi’s 24.93) and post-RL text-to-image generation scores (GenEval: 92 vs. HiDream-I1’s 83; DPG-Bench: 85.57 vs. SD3-Medium’s 84.08), with significant RL gains (e.g., +18 on GenEval from Selftok-SFT to Selftok-Zero). However, the improvement in comprehension tasks is less pronounced (MME: 1381.3 vs. LLaVA’s 1532.1), and generation speed remains a bottleneck (slower than diffusion models). The setup is reasonable but lacks direct RL comparison with diffusion-based methods and broader task diversity, raising questions about generalizability. The results match the expectation of AR tokens enhancing RL but overstate superiority without addressing practical deployment challenges.

Further Thoughts

Reflecting on Selftok’s approach, I find the integration of diffusion and autoregressive modeling compelling, particularly for RL in visual generation, as it bridges a gap in multimodal learning where traditional spatial tokenization falls short. However, I’m intrigued by potential parallels with recent advancements in video generation, where temporal causality (akin to AR structure) is critical—could Selftok’s principles extend to video tokens with physics-aware rewards, as hinted in their ongoing work? This connects to broader research in world models (e.g., Sora’s simulation capabilities), where understanding physical laws via tokenization could be transformative, yet Selftok’s current resolution and speed limitations suggest a need for hybrid approaches with spatial compression. Additionally, the reliance on program-based rewards for RL gains raises concerns about scalability to less structured tasks—could integrating human-in-the-loop feedback or unsupervised reward models (as explored in LLM alignment) offer a more robust solution? Finally, the synergy between comprehension and generation tasks shown in Selftok’s ablations hints at a deeper interplay worth exploring in the context of emergent multimodal abilities, potentially linking to scaling laws research in foundation models. These directions warrant further investigation to validate Selftok’s broader impact.