Tag: Benchmark

All the articles with the tag "Benchmark".

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Published: 16 May, 2025 at 11:10 AM

94.40 🤔

This paper introduces VideoUFO, a million-scale dataset of 1.09 million video clips across 1,291 user-focused topics for text-to-video generation, curated from YouTube with minimal overlap with existing datasets, demonstrating improved performance on worst-performing topics when training a simple model like MVDiT.
LIFEBench: Evaluating Length Instruction Following in Large Language Models

Published: 25 May, 2025 at 11:47 AM

88.64 🤔

本文通过引入LIFEBENCH基准，系统评估了26个大型语言模型在长度指令遵循上的能力，发现其在长长度约束下普遍表现不佳，且远未达到厂商宣称的最大输出长度，揭示了模型在长度感知和长文本生成上的根本局限性。
Humanity's Last Exam

Published: 4 May, 2025 at 04:28 PM

58.39 👍

本文引入HUMANITY'S LAST EXAM基准测试，通过专家创建的挑战性多模态问题，解决现有LLM基准饱和问题，评估模型在封闭式学术任务中的能力。