Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy Sources BODS Combiner #265

Open
tiredpixel opened this issue May 5, 2024 · 2 comments
Open

Deploy Sources BODS Combiner #265

tiredpixel opened this issue May 5, 2024 · 2 comments
Assignees

Comments

@tiredpixel
Copy link
Contributor

tiredpixel commented May 5, 2024

Since the rewrite of Register Files Combiner (#213), it has been possible to run the combination process locally, rather than requiring AWS services directly. However, in order for this to happen performantly, it is necessary to sync large directories of a couple of our S3 buckets to disk locally. This is all working perfectly; however, it takes rather a lot of space (~ 300G at present), and bulk data export uploads take a while (~ 12G/month).

It would be convenient to deploy Sources BODS Combiner somewhere, in order to be able to download and upload files far more quickly, as well as to minimise the chances of accidental changes to files locally. The existing oo-prd0-register EC2 server would be sufficient for this, but would require an additional EBS volume to be attached.

This is not a lot of effort and moderate additional cost (around 33 USD/month, depending on configuration), but would save time and decrease risk in the monthly bulk data process.

@tiredpixel tiredpixel self-assigned this Jun 13, 2024
@tiredpixel
Copy link
Contributor Author

A new EBS volume has been attached to EC2 server oo-prd0-register/bods-register at /mnt/data , and configured to mount automatically on boot.

Script /usr/local/bin/sync-clones has been added to help with downloading:

#!/usr/bin/env bash
set -Eeuo pipefail

ds=(
    bods_v2
)
for d in "${ds[@]}"; do
    aws s3 sync --delete s3://oo-register-v2/"$d"/ /mnt/data/clones/oo-register-v2/"$d"/
done

More directories can be synced if useful (e.g. to help in an investigation), but only bods_v2 is needed for the Combiner.

Script /usr/local/bin/sync-exports-tx has been added to help with publishing:

#!/usr/bin/env bash
set -Eeuo pipefail

aws s3 sync /mnt/data/exports/prd/     s3://oo-register-v2/exports/
aws s3 sync /mnt/data/exports/prd/all/ s3://public-bods/exports/

S3 credentials have been configured so aws s3 works.

Sources BODS has been configured to use that:

DC_SOURCES_BODS_DATA_EXPORTS=/mnt/data/exports
DC_SOURCES_BODS_DATA_IMPORTS=/mnt/data/clones/oo-register-v2/bods_v2

This exposes that data within the container:

docker compose run sources-bods bash
du -d1 -h data/
56G     data/exports
50G     data/imports
106G    data/

With this, the Combiner can be run for a single datasource:

combine data/imports/source=PSC/ data/exports/prd/ psc

And the Combiner can be run after all datasources, to update the combined snapshot:

combine-all data/exports/prd/

Keep an eye on the disk space for the EBS volume, since things will probably break if it runs out of space:

df -h /mnt/data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    150G  107G   44G  71% /mnt/data

@tiredpixel
Copy link
Contributor Author

tiredpixel commented Jun 13, 2024

Kinesis Firehose has been reconfigured to go back to the non-recommended larger buffer max sizes and flush intervals. In fact, it's now increased to 128MB/900s instead of 64MB/900s which it was previously (or the 5MB/300s that it's been running with the last few weeks). That's because the Combiner is more efficient when working with large files, and the PSC streamers which recently went live result in a lot of small files, each containing just a few records. Whilst the Combiner can work fine with this, it results in processing time increasing by multiples, because of how the indexes are built and deduplicated. So on balance, it's probably better to have longer intervals and sizes like before, at the cost of additional latency (which isn't important to us, in this case).

Next month's bulk data import/export should use the Combiner on the EC2 server, rather than locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant