A web scraper tool to get data for evaluating Wellcome impact.
To bring up the development environment using docker:
- Start a clean postgres DB:
docker-compose up -d
- Build the base image
make base_image
Then, to run a scraper from your local repo, run:
./docker_run.sh ./entrypoint.sh SPIDER_TO_RUN
where SPIDER_TO_RUN
is one of:
who_iris
gov_uk
nice
unicef
msf
parliament
If you need to run outside docker, Dockerfile.base and entrypoint.sh should point you in the right direction.
To run tests, first bring up the development environment as above.
Then, run:
./docker_run.sh python -m unittest discover -s /pwd/tests
To deploy this scraper yourself, see the wiki: https://github.com/wellcometrust/wsf-web-scraper/wiki
This scraper can also be deployed more easily using Docker.
The outputed file is meant to contain a number a different fields, which can vary depending on the scraper provider.
It will always have the following attributes, though:
Unique | Attribute | Description |
---|---|---|
title | a string containing the document title | |
* | uri | the url of the document |
the name of the file | ||
sections | a json object of section names, containing the text extracted from matching sections | |
keywords | a json object of keywords, containing the text extracted from matching text | |
* | hash | a md5 digest of the file |
provider | the provider from where the file has been downloaded | |
date_scraped | the date (YYYYMMDD) when the article has been scraped |
Some providers will have additional parameters:
Attribute | Description |
---|---|
year | the publication year of the document |
types | an array containing the WHO type associated with the document |
subjects | an array containing the WHO subjects of the document |
authors | an array containing the authors (from WHO) |
Attribute | Description |
---|---|
year | the publication year of the document |
Attribute | Description |
---|---|
year | the publication year of the document |
types | the type of the document |