fix(data-warehouse): Monkey patch DLT to reduce mem consumption #26040

Gilbert09 · 2024-11-06T21:00:00Z

Problem

Memory consumption of our temporal dwh pods is super high, we often OOM from 100GB pods
When we write the ingested data to S3 via delta-rs, the underlying rust engine loads the whole table into memory first before writing to S3
On large ingested tables, this can mean we fill the pods memory
See related delta-rs issue and PR:
- rust engine consume a lot of memory compared to pyarrow delta-io/delta-rs#2968 (comment)
- feat: no longer load full table into ram in write by using concurrent write delta-io/delta-rs#2289

We can get around this by batching the data on the DLT side
We set a max items per file and max file size to tell DLT how many files/partitions the ingested data gets broken down to
Then we monkey patch DLT (from this PR which is yet to be released [Don't Merge] Setting to control delta job count for each delta write dlt-hub/dlt#2031) to run a delta load job for each file as opposed to running a single load job for all files
- This means that each file gets uploaded in sequential order, this is a bit slower due to the overheads of opening and writing each file to S3 individually, but it means that we don't get massive memory spikes
This is all behind a flag for only team 2 postgres/bigquery sources

Likely

Tested this locally with a large table that was failing before

Monkey patch DLT to reduce mem consumption

dc216d2

Gilbert09 requested a review from a team November 6, 2024 21:00

Moved DLT config

f10a538

EDsCODE approved these changes Nov 6, 2024

View reviewed changes

Gilbert09 merged commit 34ef22e into master Nov 6, 2024
89 checks passed

Gilbert09 deleted the tom/dlt-patch-mem branch November 6, 2024 21:50

Gilbert09 mentioned this pull request Nov 7, 2024

[Don't Merge] Setting to control delta job count for each delta write dlt-hub/dlt#2031

Closed