Deploy Sources BODS Combiner #265

tiredpixel · 2024-05-05T12:45:42Z

Since the rewrite of Register Files Combiner (#213), it has been possible to run the combination process locally, rather than requiring AWS services directly. However, in order for this to happen performantly, it is necessary to sync large directories of a couple of our S3 buckets to disk locally. This is all working perfectly; however, it takes rather a lot of space (~ 300G at present), and bulk data export uploads take a while (~ 12G/month).

It would be convenient to deploy Sources BODS Combiner somewhere, in order to be able to download and upload files far more quickly, as well as to minimise the chances of accidental changes to files locally. The existing oo-prd0-register EC2 server would be sufficient for this, but would require an additional EBS volume to be attached.

This is not a lot of effort and moderate additional cost (around 33 USD/month, depending on configuration), but would save time and decrease risk in the monthly bulk data process.

tiredpixel · 2024-06-13T10:52:53Z

A new EBS volume has been attached to EC2 server oo-prd0-register/bods-register at /mnt/data , and configured to mount automatically on boot.

Script /usr/local/bin/sync-clones has been added to help with downloading:

#!/usr/bin/env bash
set -Eeuo pipefail

ds=(
    bods_v2
)
for d in "${ds[@]}"; do
    aws s3 sync --delete s3://oo-register-v2/"$d"/ /mnt/data/clones/oo-register-v2/"$d"/
done

More directories can be synced if useful (e.g. to help in an investigation), but only bods_v2 is needed for the Combiner.

Script /usr/local/bin/sync-exports-tx has been added to help with publishing:

#!/usr/bin/env bash
set -Eeuo pipefail

aws s3 sync /mnt/data/exports/prd/     s3://oo-register-v2/exports/
aws s3 sync /mnt/data/exports/prd/all/ s3://public-bods/exports/

S3 credentials have been configured so aws s3 works.

Sources BODS has been configured to use that:

DC_SOURCES_BODS_DATA_EXPORTS=/mnt/data/exports
DC_SOURCES_BODS_DATA_IMPORTS=/mnt/data/clones/oo-register-v2/bods_v2

This exposes that data within the container:

docker compose run sources-bods bash

du -d1 -h data/

56G     data/exports
50G     data/imports
106G    data/

With this, the Combiner can be run for a single datasource:

combine data/imports/source=PSC/ data/exports/prd/ psc

And the Combiner can be run after all datasources, to update the combined snapshot:

combine-all data/exports/prd/

Keep an eye on the disk space for the EBS volume, since things will probably break if it runs out of space:

df -h /mnt/data

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    150G  107G   44G  71% /mnt/data

tiredpixel · 2024-06-13T11:02:16Z

Kinesis Firehose has been reconfigured to go back to the non-recommended larger buffer max sizes and flush intervals. In fact, it's now increased to 128MB/900s instead of 64MB/900s which it was previously (or the 5MB/300s that it's been running with the last few weeks). That's because the Combiner is more efficient when working with large files, and the PSC streamers which recently went live result in a lot of small files, each containing just a few records. Whilst the Combiner can work fine with this, it results in processing time increasing by multiples, because of how the indexes are built and deduplicated. So on balance, it's probably better to have longer intervals and sizes like before, at the cost of additional latency (which isn't important to us, in this case).

Next month's bulk data import/export should use the Combiner on the EC2 server, rather than locally.

tiredpixel self-assigned this Jun 13, 2024

tiredpixel closed this as completed Jun 13, 2024

tiredpixel reopened this Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy Sources BODS Combiner #265

Deploy Sources BODS Combiner #265

tiredpixel commented May 5, 2024 •

edited

Loading

tiredpixel commented Jun 13, 2024

tiredpixel commented Jun 13, 2024 •

edited

Loading

Deploy Sources BODS Combiner #265

Deploy Sources BODS Combiner #265

Comments

tiredpixel commented May 5, 2024 • edited Loading

tiredpixel commented Jun 13, 2024

tiredpixel commented Jun 13, 2024 • edited Loading

tiredpixel commented May 5, 2024 •

edited

Loading

tiredpixel commented Jun 13, 2024 •

edited

Loading