VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

This paper introduces VideoUFO, a million-scale dataset of 1.09 million video clips across 1,291 user-focused topics for text-to-video generation, curated from YouTube with minimal overlap with existing datasets, demonstrating improved performance on worst-performing topics when training a simple model like MVDiT.

Generative AI, Text-to-Video, Dataset, Video Generation, Multimodal Data, Benchmark

Wenhao Wang, Yi Yang

University of Technology Sydney, Zhejiang University

Generated by grok-3

Background Problem

Text-to-video generative models, despite their potential in creative and practical applications like film production and education, often fail to meet real-world user expectations due to a lack of training data aligned with specific user-focused topics. The paper introduces VideoUFO, a dataset designed to bridge this gap by curating over 1.09 million video clips based on 1,291 topics derived from real user prompts, aiming to improve model performance on underrepresented or niche areas where current models struggle, such as generating accurate depictions of specific concepts like ‘glowing fireflies’.

Method

VideoUFO’s curation involves a multi-step process: (1) Analyzing user-focused topics by embedding 1.67 million prompts from the VidProM dataset into vectors using SentenceTransformers, clustering them with K-means into 2,000 clusters, merging similar ones, and generating topics with GPT-4o, resulting in 1,291 distinct topics; (2) Collecting videos via YouTube’s official API for each topic, ensuring Creative Commons licensing, high resolution (720p+), and short duration (<4 minutes), yielding 586,490 videos; (3) Segmenting videos into 3.18 million semantically consistent clips using shot boundary detection and stitching methods from Panda-70M; (4) Generating brief and detailed captions for clips using models from Panda-70M and Open-Sora-Plan (QWen2-VL-7B); (5) Verifying clip relevance to topics using GPT-4o mini on detailed captions, reducing to 1.09 million clips; and (6) Assessing video quality with VBench metrics like subject consistency and motion smoothness. This method prioritizes user alignment and data novelty, with minimal overlap (0.29%) with existing datasets.

Experiment

The experiments evaluate 16 existing text-to-video models and a new model (MVDiT) trained on VideoUFO using the BenchUFO benchmark, which tests performance on 791 concrete noun topics with 10 user prompts each from VidProM, generating videos, describing them with QWen2-VL-7B, and measuring cosine similarity between prompts and descriptions. The setup focuses on worst and best-performing topics to highlight gaps, justified by avoiding abstract nouns for clarity and using a robust video understanding model. Results show current models have inconsistent performance (score differences of 0.233-0.314 between top-10 and low-10 topics), often failing on niche topics like ‘giant squid’. MVDiT-VideoUFO achieves a 4.2% improvement on low-10 topics (score of 0.442 vs. 0.400 for state-of-the-art) while maintaining top-10 performance, outperforming models trained on similar-scale datasets like OpenVid-1M. However, limited computational resources (32 A100 GPUs) constrain the exploration of VideoUFO’s full potential, and similarity-based evaluation might not fully capture visual quality or semantic depth. The results partially meet expectations of improving worst-case performance but suggest room for deeper validation with larger-scale training.

Further Thoughts

The introduction of VideoUFO opens intriguing avenues for text-to-video generation, particularly in aligning models with user needs, but it also prompts deeper questions about dataset scalability and adaptability. Could the methodology of topic extraction and video curation be applied to other generative domains, such as text-to-audio or image-to-video, to address similar user alignment issues? The reliance on YouTube and Creative Commons licensing, while legally sound, might limit diversity if certain topics are underrepresented on the platform—future work could explore integrating data from other sources like TikTok or user-generated content platforms with appropriate ethical safeguards. Additionally, the BenchUFO benchmark’s focus on cosine similarity might undervalue aesthetic or contextual nuances in video generation; integrating human evaluation or advanced metrics like CLIP-based alignment could provide a more holistic assessment. I’m also curious about the temporal aspect—user interests evolve, as hinted in the paper’s extension section. Linking this work to continual learning paradigms could ensure datasets like VideoUFO remain relevant over time, perhaps by dynamically updating topics and videos based on real-time user prompt trends from platforms like social media or generative AI tools. Finally, connecting this to AI ethics, the emphasis on user focus could inspire frameworks for personalized generative AI that prioritize fairness and inclusivity in representing diverse cultural or regional user interests, potentially mitigating biases inherent in broader, open-domain datasets.