-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset
DanteNiewenhuis edited this page Oct 29, 2017
·
3 revisions
Our dataset is taken from archive.org. The dataset consist of xml-files for every categorie. The useful information is scraped from these files and are put into seperate json files. after all the files are parsed the whole dataset consists of 9.4GB.
from the a document the following objects are taken and indexed separately:
- title
- question
- accepted answer
- the other answers
- all comments
- score
- viewcount
The parsing of the files and pushing it to the server to about 1.5 hours on a pc with a CPU with 4 cores and 16GB ram.