Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(data-warehouse): Monkey patch DLT to reduce mem consumption #26040

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Nov 6, 2024

Problem

Changes

  • We can get around this by batching the data on the DLT side
  • We set a max items per file and max file size to tell DLT how many files/partitions the ingested data gets broken down to
  • Then we monkey patch DLT (from this PR which is yet to be released [Don't Merge] Setting to control delta job count for each delta write dlt-hub/dlt#2031) to run a delta load job for each file as opposed to running a single load job for all files
    • This means that each file gets uploaded in sequential order, this is a bit slower due to the overheads of opening and writing each file to S3 individually, but it means that we don't get massive memory spikes
  • This is all behind a flag for only team 2 postgres/bigquery sources

Does this work well for both Cloud and self-hosted?

Likely

How did you test this code?

Tested this locally with a large table that was failing before

@Gilbert09 Gilbert09 requested a review from a team November 6, 2024 21:00
@Gilbert09 Gilbert09 merged commit 34ef22e into master Nov 6, 2024
89 checks passed
@Gilbert09 Gilbert09 deleted the tom/dlt-patch-mem branch November 6, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants