Tag: Dataset

All the articles with the tag "Dataset".

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Published: 16 May, 2025 at 11:10 AM

94.40 🤔

This paper introduces VideoUFO, a million-scale dataset of 1.09 million video clips across 1,291 user-focused topics for text-to-video generation, curated from YouTube with minimal overlap with existing datasets, demonstrating improved performance on worst-performing topics when training a simple model like MVDiT.
Merge to Mix: Mixing Datasets via Model Merging

Published: 26 May, 2025 at 11:24 AM

87.71 🤔

本文提出*Merge to Mix*方法，通过模型合并技术作为代理，高效选择数据集混合用于大型模型微调，在图像分类和语言任务中显著优于传统方法，接近甚至部分超过Oracle性能。
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

Published: 19 May, 2025 at 11:16 AM

83.59 🤔

本文通过基于提示的方法初步研究了大型语言模型（LLMs）对MovieLens-1M推荐数据集的记忆程度，发现所有测试模型均表现出一定记忆，且记忆程度与推荐性能和模型规模正相关，同时揭示了流行度偏见问题。
HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment

Published: 11 May, 2025 at 11:12 AM

67.37 🤔

HAIR introduces a novel LLM alignment method using hardness-aware inverse reinforcement learning and introspective reasoning, constructing a balanced safety dataset and training category-specific reward models with GRPO-S, achieving state-of-the-art harmlessness while preserving usefulness across multiple benchmarks.

Tag: Dataset

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Merge to Mix: Mixing Datasets via Model Merging

Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment