Dataset

Our dataset is taken from archive.org. The dataset consist of xml-files for every categorie. The useful information is scraped from these files and are put into seperate json files. after all the files are parsed the whole dataset consists of 9.4GB.

from the a document the following objects are taken and indexed separately:

title
question
accepted answer
the other answers
all comments
score
viewcount

The parsing of the files and pushing it to the server to about 1.5 hours on a pc with a CPU with 4 cores and 16GB ram.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset

Clone this wiki locally