Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
batchdelete.py	batchdelete.py
get_inventory.py	get_inventory.py
get_logs.py	get_logs.py

Helper tools for the infrastructure

Log download

The get_logs.py script helps with the download and collation of log fragments for logs generated by the various loader pipelines.

$ ./get_logs.py --help
usage: get_logs.py [-h] [-g GROUP_NAME] [-s STREAM_NAME] [-o OUTPUT_FILE]

Get a whole log stream with all the fragments from AWS.

optional arguments:
  -h, --help            show this help message and exit
  -g GROUP_NAME, --group-name GROUP_NAME
                        The name of the log group to use.
  -s STREAM_NAME, --stream-name STREAM_NAME
                        The log stream name to get. By default the latest stream is queried and downloaded.
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        File to save the log results to.

Batch deletion from a versioned bucket

It's occasionally necessary to delete multiple files from the warehouse buckets. Those buckets are versioned, and thus can be quite a bit of effort to delete things.

The batchdelete.py tool helps query a list of files or for a given prefix, and also possible to run deletion for all file versions for that list or prefix.

$ ./batchdelete.py --help
usage: batchdelete.py [-h] --bucket BUCKET [--infile INFILE] [--prefix PREFIX] [--versionsfile VERSIONSFILE] [--delete] [--workers WORKERS]

Delete files from S3

optional arguments:
  -h, --help            show this help message and exit
  --bucket BUCKET       The bucket to query/delete from.
  --infile INFILE       The file containing the list of objects/prefixes to query or delete.
  --prefix PREFIX       The prefix to list all files in and optionally delete.
  --versionsfile VERSIONSFILE
                        A file with key,versionid listing to delete, generated by the querying of this script.
  --delete              Acutally try to delete after querying
  --workers WORKERS     Number of parallel workers when getting versions from an 'infile'

Inventory files downloader

The S3 inventory files are a series of gzip-compressed CSV files, that are hosted in a specific location in the warehouse infrastructure. They are generated automatically by AWS on a regular cadence (1x a day).

The CSV files have a series of file names included (up to 3,000,000 in each file), and a whole set of inventory files add up to a full inventory.

To get the latest set of inventory files, use the get_inventory.py script:

$ ./get_inventory.py --help
usage: get_inventory.py [-h] [-b BUCKET] [-o OUTPUT_FOLDER]

Download the latest set of S3 inventory files.

optional arguments:
  -h, --help            show this help message and exit
  -b BUCKET, --bucket BUCKET
                        The bucket whose inventory to grab.
  -o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
                        Where to download the inventory files

and when run with the default settings:

$ ./get_inventory.py
INFO:botocore.credentials:Found credentials in environment variables.
INFO:root:Downloading inventory file: nccid-data-warehouse-prod/daily-full-inventory/data/e082ecb7-b3b5-457a-83c1-c53abfa08b45.csv.gz
INFO:root:Saved to: e082ecb7-b3b5-457a-83c1-c53abfa08b45.csv.gz
INFO:root:Downloading inventory file: nccid-data-warehouse-prod/daily-full-inventory/data/628c1dcb-681b-43e2-b190-720f0e8de880.csv.gz
INFO:root:Saved to: 628c1dcb-681b-43e2-b190-720f0e8de880.csv.gz
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools

tools

README.md

Helper tools for the infrastructure

Log download

Batch deletion from a versioned bucket

Inventory files downloader

Files

tools

Directory actions

More options

Directory actions

More options

Latest commit

History

tools

Folders and files

parent directory

README.md

Helper tools for the infrastructure

Log download

Batch deletion from a versioned bucket

Inventory files downloader