Skip to content

Latest commit

 

History

History
16 lines (8 loc) · 1.58 KB

datasets.md

File metadata and controls

16 lines (8 loc) · 1.58 KB

List of datasets relevant for the Duke Machine Learning Summer School 2022

MNIST (Modified National Institute of Standards and Technology), which contains images of handwritten digits. MNIST is available at http://yann.lecun.com/exdb/mnist/, and we're using all 4 of the datasets listed at the top of the page, including both the training set and test set

CIFAR-10, a set of images in 10 different classes. (CIFAR stands for Canadian Institute For Advanced Research). We are using the “CIFAR-10 python version” from https://www.cs.toronto.edu/~kriz/cifar.html

COCO (Common Objects in Context). We are using 5 of the datasets from https://cocodataset.org/#download: 2017 Train images [118K/18GB] 2017 Val images [5K/1GB] 2017 Test images [41K/6GB] 2017 Train/Val annotations [241MB] 2017 Stuff Train/Val annotations [1.1GB]

IMDb (Internet Movie Database) datasets, which provide an excellent basis for NLP modeling activities around sentiment analysis: Datasets are available at https://datasets.imdbws.com/ and documentation is at https://www.imdb.com/interfaces/

CelebA (CelebFaces Attributes) with generative model development: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

ImageNet, which provides an excellent basis for image analysis activities: https://www.image-net.org/

Going directly to the source dataset is helpful because it lets you keep track of the origins/provenance; it's also more efficient because these datasets can be huge and computationally expensive to upload/download/store. That's why we include the links in the notebooks and GitHub repository, not the datasets themselves.