TODO: some more info
Structured data is in a MongoDB at
Unstructured cached HTML pages are also in the that DB in a collection called dumps (in future this data should probably go direct to s3!).
Get data out of mongodb:
mongoexport --host --db opented --username iacc --password gohack --collection dumps --csv --fields "zhtml,doc_id,timestamp" | head -n 5000 > cache/dumps.csv
Then use
python scripts/
This will produce a whole bunch of files in cache/dumps
Now push these to s3:
s3cmd sync --acl-public cache/dumps/ s3://
You will find the index of files at:
Now it's time to scrape some content!
We've written a nodejs scraper. You will need to install the dependencies first:
npm install cheerio requests
Then do:
node scripts/scraper.js
Data will be written to cache/dumps/{docid}/extracted.json