Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update backfill scripts for classifications #40

Merged
merged 4 commits into from
Oct 17, 2023

Conversation

yuenmichelle1
Copy link
Collaborator

@yuenmichelle1 yuenmichelle1 commented Oct 17, 2023

Keeping both old script and new script for record.

Old script (where we use straight up COPY from source and COPY TO target db in one go), ended up taking a long time but also query cancels out and connection to db/s is lost. Because of how COPY works, its hard to debug where connection gets lost and because its an all in situation, we get no data copied. [this is backfill_classifications.py]

To mitigate, we did the following:

  • split up the connections so that:
    1. from the source db, we save it all to csvs containing about 10,000,000 rows.
    2. once all csvs are saved, close connection from source db, since we have everything we need from it.
    3. connect to target db and copy from all the csvs saved.
    4. then close connection from target db
      [This is in backfill_classifications_chunk_in_files.py save_classifications_chunk_in_files.py then copy_classifications_from_files.py ].

Some minor changes:

  • in backfill_classifications.py, moved the long 🍑 query into its own line.
  • updated the query to take care of the case where a classification's metadata.user_groups has null in it. (eg. [1234, null, null] ). This has shown up in some classifications while testing.

@yuenmichelle1 yuenmichelle1 requested a review from zwolf October 17, 2023 15:37
Copy link
Member

@zwolf zwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though if it were me I'd probably be less inclined to run the export and the import as a single script. Since you're creating the files anyway, you have the opportunity to create them all and confirm success, then be done with that step. Your call of course but in light of previous mystery issues I'd probably break this into two scripts/steps, one to create the files and one to read them.

@yuenmichelle1 yuenmichelle1 merged commit 2f8457a into main Oct 17, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants