Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-warehouse): added sql source v2 #25858

Merged
merged 23 commits into from
Oct 31, 2024
Merged

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Oct 28, 2024

Problem

  • Our SQL source is slow and inefficient
    • We can't sync our own clickhouse as a source due to limitations of how big of a number is allowed in Python/C
    • Bigquery syncs aren't completing in a timely manner and so sessions are expiring mid syncs
  • DLT was on an old version too - we're missing out on a bunch of fixes

Changes

Known issues:

  • ARRAY types are a bit funky and need further investigating (they're currently turned off)
  • When using the new source, we can't drop the DB in tests due to an active session to it - unsure where this session is coming from, we dispose of the engine/connection and so Edit: Fixed in my last commit

Plan:

  • Test this out with team 2 postgres source, may sure everything works as expected
  • Test out the bigquery stuff
  • Fix the above known issues
  • Add support for snowflake
  • Remove code flag and release to all users

Does this work well for both Cloud and self-hosted?

Hopefully

How did you test this code?

Tested extensively locally all day

@Gilbert09 Gilbert09 requested a review from a team October 28, 2024 18:18
@rossgray
Copy link
Contributor

rossgray commented Oct 29, 2024

Just out of interest, what changes do we need to make to the default dlt source?

Most of the code is from https://github.com/dlt-hub/verified-sources/tree/master/sources/sql_database - but is fairly modified for our circumstances

Also, just for context; did v1 come from dlt too but they've since updated it, which is why we're now creating a v2?

@Gilbert09
Copy link
Member Author

Just out of interest, what changes do we need to make to the default dlt source?

Most of the code is from dlt-hub/verified-sources@master/sources/sql_database - but is fairly modified for our circumstances

@rossgray From the top of my head

  • Adding the ability to materialise empty tables (which DLT doesn't do by default)
  • Add JSON support to BigQuery
  • Closing a leaky SQL connection
  • Adjusting PyArrow types to remove interval and binary columns (they're not supported by either DLT or HogQL)
  • Adding in better support for planetscale DBs
  • Better handling of incremental values without a default
  • Adjusting how we look for primary keys
  • Not using the first row of a table to determine types (e.g. if the value of a nullable column is null, then the type gets set to NoneType as opposed to the actual correct type)

I think that's everything. Basically all the fixes we've added over time to the existing SQL source

@rossgray
Copy link
Contributor

Just out of interest, what changes do we need to make to the default dlt source?

Most of the code is from dlt-hub/verified-sources@master/sources/sql_database - but is fairly modified for our circumstances

@rossgray From the top of my head

  • Adding the ability to materialise empty tables (which DLT doesn't do by default)
  • Add JSON support to BigQuery
  • Closing a leaky SQL connection
  • Adjusting PyArrow types to remove interval and binary columns (they're not supported by either DLT or HogQL)
  • Adding in better support for planetscale DBs
  • Better handling of incremental values without a default
  • Adjusting how we look for primary keys
  • Not using the first row of a table to determine types (e.g. if the value of a nullable column is null, then the type gets set to NoneType as opposed to the actual correct type)

I think that's everything. Basically all the fixes we've added over time to the existing SQL source

Thanks. Wonder if we could contribute some of these changes back to dlt at some point since quite a few of these sound useful to others

Copy link
Contributor

@daibhin daibhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it could have been multiple PRs but if you're confident that it's all safely behind a FF then :shipit:

@Gilbert09 Gilbert09 merged commit 5647c55 into master Oct 31, 2024
89 checks passed
@Gilbert09 Gilbert09 deleted the tom/sql-source-v2-bq branch October 31, 2024 12:56
Copy link

sentry-io bot commented Oct 31, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1730380440.9322343 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants