Skip to content

bwilbertz/kaggle_allen_ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

This code allows to reproduce our solution, which scored 0.5900 on the public leaderboard and 0.58344 on the private one from Friday, 12th February 2016, 07:28:58 UTC. We worked on an i7 quad-core machine with 32GB RAM and 400GB SSD disk running Ubuntu 14.04.3 LTS.

model execution

This part has as dependencies only R(3.2.2) with packages plyr(1.8.3), dplyr(0.4.3), reshape2(1.4.1), caret(6.0-57) and xgboost(0.4-2).

In order to reproduce our submission, place the training_set.tsv, validation_set.tsv and test_set.tsv files into the R/input folder. Then execute

./scripts/unsplitRData.sh

followed by

cd R && mkdir output && R --no-save < createInputFile.R && R --no-save < runModel.R

This will produce the final submission in the folder R/output.

full model prediction pipeline

In order to explore the model prediction in detail, we here also provide the code to build the IR retrieval indices from scratch. Be aware, that this part will download more than 25GB of data, requires a running ElasticSearch installation with around 100 GB of disk space and the runtime of all scripts together will be around 5 days.

As system requirements this part needs in addition

  • most of the times Java 7 (Java 8 for NVAO transformation)
  • ElasticSearch installation 1.7.3
  • gradle 2.11
  • python 2.7 + NLTK(3.1)

and R packages

  • data.table(1.9.6)
  • elastic(0.5.0)
  • elasticdsl(0.0.3.9500)
  • FeatureHashing(0.9.1.1)
  • hash(2.2.6)
  • httr(1.0.0)
  • jsonlite(0.9.17)
  • Matrix(1.2-2)
  • rJava(0.9-7)
  • RWeka(0.4-24)
  • stringr(1.0.0)
  • text2vec(0.2.1)
  • tm(0.6-2)

(moreover Apache Tika, JSOUP, Stanford NLP library will be automatically downloaded by the gradle scripts)

data preparation

  • execute download scripts from the folder scripts

  • download and install WikiExtractor and transform the dumps downloaded from Wikipedia, e.g.

    WikiExtractor.py -o tmp/simplewiki -b 100G -s -ns Article --no-templates tmp/simplewiki-20151020-pages-articles.xml.bz2

  • execute transformation scripts as described in java/transformation

  • use quizlet.py to retrieve quizlet data (see also Readme in python folder)

  • transform the quizlet data by executing R --no-save < R/utils/importQuizlet.R

  • execute NVAO processing as described in java/nvao on the kaggle input files in R/input

ElasticSearch index creation

  • execute the lines in es/Readme.md

R pipeline

  • build the jar in java/stemming according to the Readme.md and copy the result from build/libs into R/lib

  • run the R script

    cd R && R --no-save < runFullModelPipeline.R

training

in order to re-train the model, execute

cd R && R --no-save < trainModel.R

About

kaggle allen ai competition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •