Skip to content

Datasets

greglu edited this page Jun 24, 2011 · 12 revisions

Datasets

General Information

The datasets listed below are all preloaded into the HackReduce hadoop clusters and ready for immediate use at the event. The [datasets/*] notice next to each title indicates the path where its located depending on where you want to access it:

  • Hadoop HDFS: Can be found at /datasets/*
  • Namenode local filesystem: Can be found at /mnt/datasets/*
  • HackReduce Github project: Samples found in the datasets/* folder of the project. Note: not all the datasets listed on this page will have samples in the Github project.

There's also the possibility of loading new data at the event, but this process could take a few hours. Please see one of the Hopper event organizers (probably Greg) about loading new data into your clusters.

Million Song Dataset [datasets/msd]

Special thanks to Echo Nest for converting the whole 200+ GB HDF5 format of the dataset to TSV for us

Freebase [datasets/freebase]

NASDAQ daily prices and dividends [datasets/nasdaq]

NYSE daily prices and dividends [datasets/nyse]

Wikipedia XML dump [datasets/wikipedia]

Google Ngram [datasets/ngrams]

Geonames [datasets/geonames]

Reddit voting data [datasets/reddit]

Bixi Montreal [datasets/bixi]

  • XML dump of all the bike station information queried every minute over a couple of months.
  • Provided by Fabrice (http://twitter.com/f8full)

DNS dataset [datasets/dns]

  • Contains the root file with all the domain names and their associated nameservers for the "com" TLD.

LDEO Surface Ocean CO2 Climatology data [datasets/ldeo]

Twitter dataset [datasets/twitter]

Flight dataset [datasets/flights]

  • Limited set of flight data containing origin, destination, departure time, return time, price and date.
  • Only has flights originating from SEA
  • Provided by Hopper

Amazon dataset [datasets/amazon]

IMDB dataset [datasets/imdb]

Taylor Tweets [datasets/taylor_tweets]

  • Taken around of the time of Elizabeth Taylor's death in late March 2011, this dataset was a search of all tweets containing the word "taylor" in them.
  • JSON format

Citation networks [datasets/citation-networks]