The get_logs.py
script helps with the download and collation of log fragments
for logs generated by the various loader pipelines.
$ ./get_logs.py --help
usage: get_logs.py [-h] [-g GROUP_NAME] [-s STREAM_NAME] [-o OUTPUT_FILE]
Get a whole log stream with all the fragments from AWS.
optional arguments:
-h, --help show this help message and exit
-g GROUP_NAME, --group-name GROUP_NAME
The name of the log group to use.
-s STREAM_NAME, --stream-name STREAM_NAME
The log stream name to get. By default the latest stream is queried and downloaded.
-o OUTPUT_FILE, --output-file OUTPUT_FILE
File to save the log results to.
It's occasionally necessary to delete multiple files from the warehouse buckets. Those buckets are versioned, and thus can be quite a bit of effort to delete things.
The batchdelete.py
tool helps query a list of files or for a given prefix, and also
possible to run deletion for all file versions for that list or prefix.
$ ./batchdelete.py --help
usage: batchdelete.py [-h] --bucket BUCKET [--infile INFILE] [--prefix PREFIX] [--versionsfile VERSIONSFILE] [--delete] [--workers WORKERS]
Delete files from S3
optional arguments:
-h, --help show this help message and exit
--bucket BUCKET The bucket to query/delete from.
--infile INFILE The file containing the list of objects/prefixes to query or delete.
--prefix PREFIX The prefix to list all files in and optionally delete.
--versionsfile VERSIONSFILE
A file with key,versionid listing to delete, generated by the querying of this script.
--delete Acutally try to delete after querying
--workers WORKERS Number of parallel workers when getting versions from an 'infile'
The S3 inventory files are a series of gzip-compressed CSV files, that are hosted in a specific location in the warehouse infrastructure. They are generated automatically by AWS on a regular cadence (1x a day).
The CSV files have a series of file names included (up to 3,000,000 in each file), and a whole set of inventory files add up to a full inventory.
To get the latest set of inventory files, use the get_inventory.py
script:
$ ./get_inventory.py --help
usage: get_inventory.py [-h] [-b BUCKET] [-o OUTPUT_FOLDER]
Download the latest set of S3 inventory files.
optional arguments:
-h, --help show this help message and exit
-b BUCKET, --bucket BUCKET
The bucket whose inventory to grab.
-o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
Where to download the inventory files
and when run with the default settings:
$ ./get_inventory.py
INFO:botocore.credentials:Found credentials in environment variables.
INFO:root:Downloading inventory file: nccid-data-warehouse-prod/daily-full-inventory/data/e082ecb7-b3b5-457a-83c1-c53abfa08b45.csv.gz
INFO:root:Saved to: e082ecb7-b3b5-457a-83c1-c53abfa08b45.csv.gz
INFO:root:Downloading inventory file: nccid-data-warehouse-prod/daily-full-inventory/data/628c1dcb-681b-43e2-b190-720f0e8de880.csv.gz
INFO:root:Saved to: 628c1dcb-681b-43e2-b190-720f0e8de880.csv.gz
...