wsf-web-scraper

A web scraper tool to get data for evaluating Wellcome impact.

Development

To bring up the development environment using docker:

Start a clean postgres DB:
```
docker-compose up -d
```
Build the base image
```
make base_image
```

Then, to run a scraper from your local repo, run:

./docker_run.sh ./entrypoint.sh SPIDER_TO_RUN

where SPIDER_TO_RUN is one of:

who_iris
gov_uk
nice
unicef
msf
parliament

If you need to run outside docker, Dockerfile.base and entrypoint.sh should point you in the right direction.

Testing

To run tests, first bring up the development environment as above.

Then, run:

./docker_run.sh python -m unittest discover -s /pwd/tests

Usage

To deploy this scraper yourself, see the wiki: https://github.com/wellcometrust/wsf-web-scraper/wiki

This scraper can also be deployed more easily using Docker.

Output Formatting

The outputed file is meant to contain a number a different fields, which can vary depending on the scraper provider.

It will always have the following attributes, though:

Unique	Attribute	Description
	title	a string containing the document title
*	uri	the url of the document
	pdf	the name of the file
	sections	a json object of section names, containing the text extracted from matching sections
	keywords	a json object of keywords, containing the text extracted from matching text
*	hash	a md5 digest of the file
	provider	the provider from where the file has been downloaded
	date_scraped	the date (YYYYMMDD) when the article has been scraped

Some providers will have additional parameters:

WHO

Attribute	Description
year	the publication year of the document
types	an array containing the WHO type associated with the document
subjects	an array containing the WHO subjects of the document
authors	an array containing the authors (from WHO)

Nice

Attribute	Description
year	the publication year of the document

Parliament

Attribute	Description
year	the publication year of the document
types	the type of the document

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
attic		attic
db		db
pdf_parser		pdf_parser
resources		resources
tests		tests
tools		tools
var/tmp		var/tmp
wsf_scraping		wsf_scraping
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.base		Dockerfile.base
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
docker-compose.yml		docker-compose.yml
docker_run.sh		docker_run.sh
entrypoint.sh		entrypoint.sh
pull_request_template.md		pull_request_template.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wsf-web-scraper

Development

Testing

Usage

Output Formatting

WHO

Nice

Parliament

About

Releases

Packages

Contributors 3

Languages

wellcometrust/wsf-web-scraper

Folders and files

Latest commit

History

Repository files navigation

wsf-web-scraper

Development

Testing

Usage

Output Formatting

WHO

Nice

Parliament

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages