Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-warehouse): external data job rewrite #21494

Merged
merged 30 commits into from
Apr 17, 2024
Merged

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Apr 11, 2024

Problem

  • Syncing data warehouse data is done from a source perspective, meaning:
    • All tables must be completed successfully for the whole sync to be deemed as "successful"
    • Smaller table syncs have to wait for any larger syncs to complete first
    • Isolated errors within a single schema/table cause the whole source sync to fail
    • A bigger strain on worker resource when syncing large sources

Changes

  • Update the external-data-job to sync a single schema within a source. Meaning multiple instances of the same temporal job is required to sync a whole source
  • Added a migration path for existing sources:
    • When a source syncs, if it doesn't have a schema_id in the input object, then:
      • Create temporal schedules for all active schemas
      • Trigger these new schedules
      • Delete the existing source schedule
      • and exit the current worker
    • Otherwise, if the schema_id exists in the input object, then continue as normal
  • Added a new status column to external_data_schema - this is the new source of truth for the status of the sync
    • Deprecated the status column of external_data_source
    • The status of a source is determined by the combination of all statuses of the sources schema children
      • If one schema is failing, them the source label will show as failed
  • Added a schema_id to the external_data_jobs table to hold which schema the job relates to
  • Updated the create source API to trigger the correct jobs

Whats not done:

  • Updated the UI to show a status label per schema - can be added later. This PR should be a no-op change from a users perspective on the frontend

Does this work well for both Cloud and self-hosted?

  • Probably

How did you test this code?

  • Updated existing unit tests for the data job
  • Added new unit tests for the new schedule activity
  • A fair bit of local testing with new and existing sources locally

finally:
heartbeat_task.cancel()
await asyncio.wait([heartbeat_task])
for schema in schemas:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Haven't ever tried this in practice but seems reasonable

@Gilbert09 Gilbert09 changed the title feat(data-warehouse): WIP data job rewrite feat(data-warehouse): external data job rewrite Apr 12, 2024
@Gilbert09 Gilbert09 marked this pull request as ready for review April 12, 2024 14:39
@Gilbert09 Gilbert09 requested a review from EDsCODE April 12, 2024 14:39
Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added deletion logic, updated mypy baseline, fixed the resyncing logic, some frontend tweaks

Copy link
Contributor

github-actions bot commented Apr 17, 2024

Size Change: 0 B

Total Size: 1000 kB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 1000 kB

compressed-size-action

@EDsCODE
Copy link
Member

EDsCODE commented Apr 17, 2024

The failing migration check is on "null constraint for a small table" should be fine to merge.

@EDsCODE EDsCODE merged commit b2773cb into master Apr 17, 2024
102 of 105 checks passed
@EDsCODE EDsCODE deleted the tom/data-job-rewrite branch April 17, 2024 21:08
Copy link

sentry-io bot commented Apr 18, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1713860668.035367 with exception: dlt.pipeline.pipeline in extract View Issue
  • ‼️ OperationalError: (psycopg2.OperationalError) connection to server on socket "@aavvgneg8z0ad.cdsi0m53y9as.us-east-2... sqlalchemy.sql.schema in __new__ View Issue
  • ‼️ OperationalError: connection is bad: Name or service not known posthog.warehouse.models.external_data_schema i... View Issue
  • ‼️ OperationalError: consuming input failed: server closed the connection unexpectedly posthog.warehouse.external_data_source.jobs in ... View Issue
  • ‼️ OperationalError: the connection is closed posthog.warehouse.external_data_source.jobs in ... View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants