Tag: Dataset
All the articles with the tag "Dataset".
-
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
This paper introduces VideoUFO, a million-scale dataset of 1.09 million video clips across 1,291 user-focused topics for text-to-video generation, curated from YouTube with minimal overlap with existing datasets, demonstrating improved performance on worst-performing topics when training a simple model like MVDiT.
-
Merge to Mix: Mixing Datasets via Model Merging
本文提出*Merge to Mix*方法,通过模型合并技术作为代理,高效选择数据集混合用于大型模型微调,在图像分类和语言任务中显著优于传统方法,接近甚至部分超过Oracle性能。
-
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M
本文通过基于提示的方法初步研究了大型语言模型(LLMs)对MovieLens-1M推荐数据集的记忆程度,发现所有测试模型均表现出一定记忆,且记忆程度与推荐性能和模型规模正相关,同时揭示了流行度偏见问题。
-
HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment
HAIR introduces a novel LLM alignment method using hardness-aware inverse reinforcement learning and introspective reasoning, constructing a balanced safety dataset and training category-specific reward models with GRPO-S, achieving state-of-the-art harmlessness while preserving usefulness across multiple benchmarks.