Skip to content

Latest commit

 

History

History
74 lines (48 loc) · 2.52 KB

Readme.md

File metadata and controls

74 lines (48 loc) · 2.52 KB

Movielens Dataset

This folder contains scripts and tasks to import the movies-100k dataset that contains 100.000 ratings from nearly 1.000 users for about 1.700 different movies, all part of the Movielens.org website. It also allows to download and import the movies-1M dataset with 1 million ratings from 6.000 users on 4.000 movies.

The repository should be available at /vagrant inside the VM. First connect into the VM and set up the data set.

To connect into the VM run:

$ vagrant ssh

The following commands are run inside the VM.

$ cd /vagrant/dataset/movies-100k
$ gem install bundler
$ bundle install

First bundler is installed, then all required gem dependencies.

Inside the folder dataset/movies-100k there is a Rakefile that provides a number of tasks. To display a list of all available rake tasks run:

$ bundle exec rake -T

Dataset

Go to the data set folder and run the following commands to upload the dataset to Elasticsearch

$ cd /vagrant/dataset/movies-100k

First we create an Elasticsearch index to store the data set and define the mappings for all types. We use the elasticsearch-rake-tasks gem and run the following command:

$ bundle exec rake es:movies:create[http://localhost:9200,movies]

This creates a new index named "movies" at the local Elasticsearch instance and applies the template with the same name.

Then run the rake task, for the movies 100k data set:

$ bundle exec rake create_100k_data_set

For the data set containing 1M ratings use

$ bundle exec rake create_1m_data_set

This first downloads the movies100k / 1M data set to a tmp folder, then extracts and transforms all the users, genres, movies and ratings from the data set and creates a JSON file compatible with the Elasticsearch Bulk API.

The last step is to bulk upload the generated seed file to Elasticsearch, which is done by:

$ curl -X POST 'http://localhost:9200/movies/_bulk' --data-binary @item_seed.json > /dev/null

This might fail for the 1M documents bulk file. Alternatively use the rake command to bulk upload which takes a bit longer:

$ bundle exec rake upload_bulk

This uploads all entries from the seed.json file to the Elasticsearch index named movies.