Skip to content

Latest commit

 

History

History
75 lines (44 loc) · 3.37 KB

datasets.md

File metadata and controls

75 lines (44 loc) · 3.37 KB

Datasets

For Open-Sora 1.2, we conduct mixed training with both images and videos. The main datasets we use are listed below. Please refer to README for data processing.

Video

Webvid-10M

Webvid-10M contains 10 million video-text pairs scraped from the stock footage sites. We first train the model on this dataset (40k hours) for 30k steps (2 epochs).

Panda-70M

Panda-70M is a large-scale dataset with 70M video-caption pairs. We use the training-10M subset for training, which contains ~10M videos of better quality.

Mixkit

Mixkit is a video website where we obtained 9k videos.

Pixabay

Pixabay is video website where we obtained 60.5k videos.

Pexels

Pexels is a popular online platform that provides high-quality stock photos, videos, and music for free. Most videos from this website are of high quality. Thus, we use them for both pre-training and HQ fine-tuning. We really appreciate the great platform and the contributors!

Inter4K

Inter4K is a dataset containing 1K video clips with 4K resolution. The dataset is proposed for super-resolution tasks. We use the dataset for HQ fine-tuning.

HD-VG-130M

HD-VG-130M comprises 130M text-video pairs. The caption is generated by BLIP-2. We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset.

MiraData

MiraData: a high-quality dataset with 77k long videos, mainly from games and city/scenic exploration.

Vript

Vript: a densely annotated dataset of 400k videos.

Image

Midjourney-v5-1.7M

Midjourney-v5-1.7M includes 1.7M image-text pairs. In detail, this dataset introduces two subsets: original and upscale. This dataset is proposed for exploring the relationship of prompts and high-quality images.

Midjourney-kaggle-clean

Midjourney-kaggle-clean is a reconstructed version of Midjourney User Prompts & Generated Images (250k), which is cleaned by rules. Moreover, this dataset is divided into two subsets: original and upscale. This dataset is proposed for enabling research on text-to-image model prompting.

Unsplash-lite

The Unsplash-lite Dataset comprises 25k nature-themed Unsplash photos, 25k keywords, and 1M searches. This dataset covers a vast range of uses and contexts. Its extensive scope in intent and semantics opens new avenues for research and learning.

LAION-AESTHETICS 6.5+

LAION aesthetic 6.5+ dataset is a subset of the LAION dataset, which contains 625K high-quality images with aesthetic scores > 6.5. However, as LAION is currently not publicly available, we use this 168k subset.