Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Helper tools for the infrastructure

Log download

The get_logs.py script helps with the download and collation of log fragments for logs generated by the various loader pipelines.

$ ./get_logs.py --help
usage: get_logs.py [-h] [-g GROUP_NAME] [-s STREAM_NAME] [-o OUTPUT_FILE]

Get a whole log stream with all the fragments from AWS.

optional arguments:
  -h, --help            show this help message and exit
  -g GROUP_NAME, --group-name GROUP_NAME
                        The name of the log group to use.
  -s STREAM_NAME, --stream-name STREAM_NAME
                        The log stream name to get. By default the latest stream is queried and downloaded.
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        File to save the log results to.

Batch deletion from a versioned bucket

It's occasionally necessary to delete multiple files from the warehouse buckets. Those buckets are versioned, and thus can be quite a bit of effort to delete things.

The batchdelete.py tool helps query a list of files or for a given prefix, and also possible to run deletion for all file versions for that list or prefix.

$ ./batchdelete.py --help
usage: batchdelete.py [-h] --bucket BUCKET [--infile INFILE] [--prefix PREFIX] [--versionsfile VERSIONSFILE] [--delete] [--workers WORKERS]

Delete files from S3

optional arguments:
  -h, --help            show this help message and exit
  --bucket BUCKET       The bucket to query/delete from.
  --infile INFILE       The file containing the list of objects/prefixes to query or delete.
  --prefix PREFIX       The prefix to list all files in and optionally delete.
  --versionsfile VERSIONSFILE
                        A file with key,versionid listing to delete, generated by the querying of this script.
  --delete              Acutally try to delete after querying
  --workers WORKERS     Number of parallel workers when getting versions from an 'infile'

Inventory files downloader

The S3 inventory files are a series of gzip-compressed CSV files, that are hosted in a specific location in the warehouse infrastructure. They are generated automatically by AWS on a regular cadence (1x a day).

The CSV files have a series of file names included (up to 3,000,000 in each file), and a whole set of inventory files add up to a full inventory.

To get the latest set of inventory files, use the get_inventory.py script:

$ ./get_inventory.py --help
usage: get_inventory.py [-h] [-b BUCKET] [-o OUTPUT_FOLDER]

Download the latest set of S3 inventory files.

optional arguments:
  -h, --help            show this help message and exit
  -b BUCKET, --bucket BUCKET
                        The bucket whose inventory to grab.
  -o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
                        Where to download the inventory files

and when run with the default settings:

$ ./get_inventory.py
INFO:botocore.credentials:Found credentials in environment variables.
INFO:root:Downloading inventory file: nccid-data-warehouse-prod/daily-full-inventory/data/e082ecb7-b3b5-457a-83c1-c53abfa08b45.csv.gz
INFO:root:Saved to: e082ecb7-b3b5-457a-83c1-c53abfa08b45.csv.gz
INFO:root:Downloading inventory file: nccid-data-warehouse-prod/daily-full-inventory/data/628c1dcb-681b-43e2-b190-720f0e8de880.csv.gz
INFO:root:Saved to: 628c1dcb-681b-43e2-b190-720f0e8de880.csv.gz
...