List of datasets relevant for the Duke Machine Learning Summer School 2022
MNIST (Modified National Institute of Standards and Technology), which contains images of handwritten digits. MNIST is available at http://yann.lecun.com/exdb/mnist/, and we're using all 4 of the datasets listed at the top of the page, including both the training set and test set
CIFAR-10, a set of images in 10 different classes. (CIFAR stands for Canadian Institute For Advanced Research). We are using the “CIFAR-10 python version” from https://www.cs.toronto.edu/~kriz/cifar.html
COCO (Common Objects in Context). We are using 5 of the datasets from https://cocodataset.org/#download: 2017 Train images [118K/18GB] 2017 Val images [5K/1GB] 2017 Test images [41K/6GB] 2017 Train/Val annotations [241MB] 2017 Stuff Train/Val annotations [1.1GB]
IMDb (Internet Movie Database) datasets, which provide an excellent basis for NLP modeling activities around sentiment analysis: Datasets are available at https://datasets.imdbws.com/ and documentation is at https://www.imdb.com/interfaces/
CelebA (CelebFaces Attributes) with generative model development: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
ImageNet, which provides an excellent basis for image analysis activities: https://www.image-net.org/
Going directly to the source dataset is helpful because it lets you keep track of the origins/provenance; it's also more efficient because these datasets can be huge and computationally expensive to upload/download/store. That's why we include the links in the notebooks and GitHub repository, not the datasets themselves.