GitHub - anderspeders/opented: Hacking around with Tenders Electronic Daily

TODO: some more info

Data Processing Pipeline

Structured data is in a MongoDB at opented.org/opented

Unstructured cached HTML pages are also in the that DB in a collection called dumps (in future this data should probably go direct to s3!).

1. Get dumps onto s3

Get data out of mongodb:

mongoexport --host opented.org --db opented --username iacc --password gohack --collection dumps --csv --fields "zhtml,doc_id,timestamp" | head -n 5000 > cache/dumps.csv

Then use extract.py:

python scripts/extract.py

This will produce a whole bunch of files in cache/dumps

Now push these to s3:

s3cmd sync --acl-public cache/dumps/ s3://files.opented.org/scraped/

You will find the index of files at: http://files.opented.org.s3.amazonaws.com/scraped/index.json

2. Scraping content

Now it's time to scrape some content!

We've written a nodejs scraper. You will need to install the dependencies first:

npm install cheerio requests

Then do:

node scripts/scraper.js

Data will be written to cache/dumps/{docid}/extracted.json

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Processing Pipeline

1. Get dumps onto s3

2. Scraping content

About

Releases

Packages

anderspeders/opented

Folders and files

Latest commit

History

Repository files navigation

Data Processing Pipeline

1. Get dumps onto s3

2. Scraping content

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages