Skip to content

Latest commit

 

History

History
339 lines (262 loc) · 14 KB

File metadata and controls

339 lines (262 loc) · 14 KB

Vaccination update automation

Python 3" Contribute Data

Vaccination data is updated on a daily basis. For some countries, the update is done by means of an automated process, while others require some manual work. To keep track of the currently automated processes, check this table.

Content

  1. Directory content
  2. Development environment
  3. The data pipeline
  4. Other functions
  5. Contribute
  6. FAQs

1. Directory content

This directory contains the following files:

File name Description
output/ Temporary automated imports are placed here.
src/vax/ Scripts to automate country data imports.
config.yaml Data pipeline configuration.
us_states/input/ Data for US-state vaccination data updates.
MANIFEST.in, setup.py, requirements.txt, requirements-flake.txt Library development related files
automation_state.csv Lists if country process is automated (TRUE) or not (FALSE).
source_table.html HTML table with country source URLs. Shown at OWID's website.
vax_update.sh.template Template to push vaccination update changes.

*Only most relevant files have been listed

2. Development environment

Show steps ... Follow the steps below to correctly set up your virtual environment.

Python version

Make sure you have a working environment with Python 3 installed. We use Python >= 3.7.

You can check this with:

python --version

Install library

In your environment (shell), install the library in development mode. That is, run:

$ pip install -e .

In addition to owid-covid19-vaccination-dev package, this will install the command tool cowid-vax, which is required to run the data pipeline.

Configuration file

To correctly run the data pipeline, make sure to have a valid configuration file. We currently use config.yaml. This file contains data used throughout the different pipeline stages.

global:
  project_dir: !ENV ${OWID_COVID_PROJECT_DIR}
  credentials: !ENV ${OWID_COVID_VAX_CREDENTIALS_FILE}
pipeline:
  get-data:
    parallel: True
    countries:
    njobs: -2
    skip_countries:
      - Colombia
  process-data:
    skip_complete:
    skip_monotonic_check:
      Northern Ireland:
        - date: 2021-04-29
          metrics: people_vaccinated
    skip_anomaly_check:
      Bahrain:
        - date: 2021-03-06
          metrics: total_vaccinations
      Bolivia:
        - date: 2021-03-06
          metrics: people_vaccinated
      Brazil:
        - date: 2021-01-21
          metrics: 
           - total_vaccinations
           - people_vaccinated
  generate-dataset:

Our current configuration requires to previously set environment variables ${OWID_COVID_PROJECT_DIR} and ${OWID_COVID_VAX_CREDENTIALS_FILE}. We recommend defining them in ~/.bashrc or /.bash_profile. For instance:

export OWID_COVID_PROJECT_DIR=/Users/username/projects/covid-19-data
export OWID_COVID_VAX_CREDENTIALS_FILE=${OWID_COVID_PROJECT_DIR}/scripts/scripts/vaccinations/vax_dataset_config.json

Credentials file

The environment variable ${OWID_COVID_VAX_CREDENTIALS_FILE} corresponds to the path to the credentials file. This is internal. Google-related fields require a valid OAuth JSON credentials file (see gsheets documentation). The file should have the following structure:

{
    "greece_api_token": "[GREECE_API_TOKEN]",
    "owid_cloud_table_post": "[OWID_CLOUD_TABLE_POST]",
    "google_credentials": "[CREDENTIALS_JSON_PATH]",
    "google_spreadsheet_vax_id": "[SHEET_ID]",
    "twitter_consumer_key": "[TWITTER_CONSUMER_KEY]",
    "twitter_consumer_secret": "[TWITTER_CONSUMER_SECRET]"
}

Check the style

We use flake8 to check the style of our code. The configuration lives in file tox.ini. To check the style, simply run

$ tox

Note: This requires tox to be installed ($ pip install tox)

3. The data pipeline

To update the data, prior to running the code, make sure to correctly set up the development environment.

Manual data updates

Check for new updates and manually add them in the internal spreadsheet:

  • See this repo's pull requests and issues.
  • Look for new data based on previously-used source URLs.

Automated process

Once all manual processes have been finished, it is time to leverage the tool cowid-vax. The automation step is further broken into 4 sub-steps, which we explain below. While these can all be run at once, we recommend running them one by one. Prior to running these, make sure you are correctly using your configuration file.

Note: you can use vax_update.sh.template as an example of how to run the data pipeline automated step.

Data pipeline configuration

To correctly use the configuration in your config.yaml, you can:

  • Set environment variable ${OWID_COVID_VAX_CONFIG_FILE} to file's path.
  • Save configuration under ~/.config/cowid/config.yaml and run.
  • Run $ cowid-vax --config config.yaml, explicitly specifying the path to the config file. If above was not possible, use arguments passed via the command call, i.e. --parallel, --countries, etc.
For more details run: cowid-vax --help
usage: cowid-vax [-h] [-c COUNTRIES] [-p] [-j NJOBS] [-s] [--config CONFIG] [--credentials CREDENTIALS] [--checkr]
                 {get-data,process-data,generate-dataset,export,all}

Execute COVID-19 vaccination data collection pipeline.

positional arguments:
  {get-data,process-data,generate-dataset,export,all}
                        Choose a step: i) `get-data` will run automated scripts, 2) `process-data` will get csvs generated in
                        1 and collect all data from spreadsheet, 3) `generate-dataset` generate the output files, 4) `export`
                        to generate all final files, 5) `all` will run all steps sequentially.

optional arguments:
  -h, --help            show this help message and exit
  -c COUNTRIES, --countries COUNTRIES
                        Run for a specific country. For a list of countries use commas to separate them (only in mode get-
                        data)E.g.: peru, norway. Special keywords: 'all' to run all countries, 'incremental' to run
                        incrementalupdates, 'batch' to run batch updates. Defaults to all countries. (default: all)
  -p, --parallel        Execution done in parallel (only in mode get-data). (default: False)
  -j NJOBS, --njobs NJOBS
                        Number of jobs for parallel processing. Check Parallel class in joblib library for more info (only in
                        mode get-data). (default: -2)
  -s, --show-config     Display configuration parameters at the beginning of the execution. (default: False)
  --config CONFIG       Path to config file (YAML). Will look for file in path given by environment variable
                        `$OWID_COVID_VAX_CONFIG_FILE`. If not set, will default to ~/.config/cowid/config.yaml (default:
                        /Users/lucasrodes/repos/covid-19-data/scripts/scripts/vaccinations/config.yaml)
  --credentials CREDENTIALS
                        Path to credentials file (JSON). If a config file is being used, the value ther will be prioritized.
                        (default: vax_dataset_config.json)
  --checkr              Compare results from generate-dataset with results obtained with former generate_dataset.R script.It
                        requires that the R script is previously run (without removing temporary files vax & metadata)!
                        (default: False)

Get the data

Run:

$ cowid-vax get

This step runs the scrips for batch and incremental updates. It will then generate individual country files and save them in output.

Note: This step might crash for some countries, as the automation scripts might no longer (or temporarily) work (e.g. due to changes in the source). Try to keep the scripts up to date.

Process the data

Run:

$ cowid-vax process

Collect manually updated data from the spreadsheet and data generated in (1). Process this data, and generate public country data in country_data, as well as temporary files vaccinations.preliminary.csv and metadata.preliminary.csv.

Generate the dataset

Run:

$ cowid-vax generate

Generate pipeline output files.

Export final files and update website

Run:

$ cowid-vax export

Final pipeline step. This updates few more output files. Also, this opens OWID's vaccination website, in order to update the table references (HTML is automatically copied to clipboard).

Generated files

Once the automation is successfully executed, the following files and directories are updated:

File name Description
vaccinations.csv Main output with vaccination data of all countries.
vaccinations.json Same as vaccinations.csv but in JSON format.
vaccinations-by-manufacturer.csv Secondary output with vaccination by manufacturer for a select number of countries.
country_data/ Individual country CSV files.
locations.csv Country-level metadata.
source_table.csv HTML table with country source URLs. Shown at OWID's website
automation_state.csv Lists if country process is automated (TRUE) or not (FALSE).
COVID-19 - Vaccinations.csv Internal file for OWID grapher on vaccinations.
COVID-19 - Vaccinations by manufacturer.csv Internal file for OWID grapher on vaccinations by manufacturer.

You can find more information about these files here.

Notes

You can run several steps at once, e.g.

$ cowid-vax get process

4. Other functions

Tracking

It is extremely useful to get some insights on which data are we tracking (and which are we not). This can be done with the tool cowid-vax-track. Find below some use cases.

Note: Use uption --to-csv to export results as csv files (a default filename is used).

Which countries are missing? Run
$ cowid-vax-track countries-missing

Countries are given from most to least populated.

Which countries have been updated unfrequently? Get the list of countries sorted by least frequently updated. The update frequency is defined by the ratio between the number of days with an update and the number of days of observation (i.e. days since first update).
$ cowid-vax-track countries-least-updatedfreq

Countries are given from least to most frequently updated.

Which countries haven't been updated for some time? Get the list of countries and their last update by running:
$ cowid-vax-track countries-last-updated

Countries are given from least to most recently updated.

Which countries have been updated few times? Get the list of countries least updated (in absolute counts):
$ cowid-vax-track countries-least-updated

Countries are given from least to most frequently updated.

Which vaccines are missing? Get the list of countries with missing vaccines:
$ cowid-vax-track vaccines-missing

Countries are given from the one with the least to the one with he most number of untracked vaccines.

5. Contribute

We welcome contributions! Read more in CONTRIBUTE

6. FAQs

Any question or suggestion?

Kindly open an issue. If you have a technical proposal, feel free to open a pull request

An automation no longer works (internal)

If you detect that an automation is no longer working, and the process seems like it can't be fixed at the moment:

  • Set its state to automated = FALSE in the LOCATIONS tab of the internal spreadsheet.
  • Add a new tab in the spreadsheet to manually input the country data. Make sure to include the historical data from the output file.
  • Delete the automation script and automated CSV output to avoid confusion.