kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

This code allows to reproduce our solution, which scored 0.5900 on the public leaderboard and 0.58344 on the private one from Friday, 12th February 2016, 07:28:58 UTC. We worked on an i7 quad-core machine with 32GB RAM and 400GB SSD disk running Ubuntu 14.04.3 LTS.

model execution

This part has as dependencies only R(3.2.2) with packages plyr(1.8.3), dplyr(0.4.3), reshape2(1.4.1), caret(6.0-57) and xgboost(0.4-2).

In order to reproduce our submission, place the training_set.tsv, validation_set.tsv and test_set.tsv files into the R/input folder. Then execute

./scripts/unsplitRData.sh

followed by

cd R && mkdir output && R --no-save < createInputFile.R && R --no-save < runModel.R

This will produce the final submission in the folder R/output.

full model prediction pipeline

In order to explore the model prediction in detail, we here also provide the code to build the IR retrieval indices from scratch. Be aware, that this part will download more than 25GB of data, requires a running ElasticSearch installation with around 100 GB of disk space and the runtime of all scripts together will be around 5 days.

As system requirements this part needs in addition

most of the times Java 7 (Java 8 for NVAO transformation)
ElasticSearch installation 1.7.3
gradle 2.11
python 2.7 + NLTK(3.1)

and R packages

data.table(1.9.6)
elastic(0.5.0)
elasticdsl(0.0.3.9500)
FeatureHashing(0.9.1.1)
hash(2.2.6)
httr(1.0.0)
jsonlite(0.9.17)
Matrix(1.2-2)
rJava(0.9-7)
RWeka(0.4-24)
stringr(1.0.0)
text2vec(0.2.1)
tm(0.6-2)

(moreover Apache Tika, JSOUP, Stanford NLP library will be automatically downloaded by the gradle scripts)

data preparation

execute download scripts from the folder scripts
download and install WikiExtractor and transform the dumps downloaded from Wikipedia, e.g.

WikiExtractor.py -o tmp/simplewiki -b 100G -s -ns Article --no-templates tmp/simplewiki-20151020-pages-articles.xml.bz2
execute transformation scripts as described in java/transformation
use quizlet.py to retrieve quizlet data (see also Readme in python folder)
transform the quizlet data by executing R --no-save < R/utils/importQuizlet.R
execute NVAO processing as described in java/nvao on the kaggle input files in R/input

ElasticSearch index creation

execute the lines in es/Readme.md

R pipeline

build the jar in java/stemming according to the Readme.md and copy the result from build/libs into R/lib
run the R script

cd R && R --no-save < runFullModelPipeline.R

training

in order to re-train the model, execute

cd R && R --no-save < trainModel.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

model execution

full model prediction pipeline

data preparation

ElasticSearch index creation

R pipeline

training

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
R		R
es		es
java		java
python		python
scripts		scripts
.project		.project
LICENSE.md		LICENSE.md
Readme.md		Readme.md

License

bwilbertz/kaggle_allen_ai

Folders and files

Latest commit

History

Repository files navigation

kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

model execution

full model prediction pipeline

data preparation

ElasticSearch index creation

R pipeline

training

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages