Tag: Multimodal Data

All the articles with the tag "Multimodal Data".

Thermal Detection of People with Mobility Restrictions for Barrier Reduction at Traffic Lights Controlled Intersections

Published: 15 May, 2025 at 11:04 AM

96.91 🤔

This paper introduces a thermal detector-based traffic light system using YOLO-Thermal, a modified YOLOv8 framework, to dynamically adjust signal timings for individuals with mobility restrictions, achieving superior detection accuracy (89.1% APval) and enhancing intersection accessibility while addressing privacy and adverse condition challenges.
PICD: Versatile Perceptual Image Compression with Diffusion Rendering

Published: 15 May, 2025 at 11:10 AM

95.81 🤔

PICD introduces a versatile perceptual image compression codec using diffusion rendering with three-tiered conditioning to achieve high text accuracy and visual quality for both screen and natural images, outperforming existing methods in key metrics like FID and text accuracy.
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Published: 16 May, 2025 at 11:10 AM

94.40 🤔

This paper introduces VideoUFO, a million-scale dataset of 1.09 million video clips across 1,291 user-focused topics for text-to-video generation, curated from YouTube with minimal overlap with existing datasets, demonstrating improved performance on worst-performing topics when training a simple model like MVDiT.
Gameplay Highlights Generation

Published: 14 May, 2025 at 11:06 AM

92.19 🤔

This paper presents a method to generate gameplay highlight reels by finetuning the X-CLIP multimodal model on an in-house FPS game dataset, achieving over 90% event detection accuracy and demonstrating transfer learning, while optimizing deployment through quantization.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Published: 3 Jun, 2025 at 11:45 AM

91.52 🤔

本文提出ProRL方法，通过长时间强化学习结合KL散度惩罚和参考策略重置，在多样化任务上训练Nemotron-Research-Reasoning-Qwen-1.5B模型，显著扩展了大型语言模型的推理边界，尤其在基础模型表现较差的领域和分布外任务上表现出色。

Tag: Multimodal Data

Thermal Detection of People with Mobility Restrictions for Barrier Reduction at Traffic Lights Controlled Intersections

PICD: Versatile Perceptual Image Compression with Diffusion Rendering

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Gameplay Highlights Generation

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models