Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(persons): Add simple backfill command for distinct ID overrides. #20562

Merged
merged 3 commits into from
Feb 27, 2024

Conversation

tkaemming
Copy link
Contributor

@tkaemming tkaemming commented Feb 26, 2024

Problem

We need a convenient way to populate the person_distinct_id_overrides table (#20326) with data existed prior to the Kafka engine and materialized view (#20349.)

This is a simplified version of the initial backfill procedure idea. More context about the change in approach can be found here. This essentially acts as a limited POPULATE.

How did you test this code?

Added a (somewhat contrived) test.

Also adds a fairly contrived test case, as I'd already written most of it anyway...
@tkaemming tkaemming marked this pull request as ready for review February 26, 2024 22:06
Comment on lines +22 to +32
SELECT
team_id,
distinct_id,
argMax(person_id, version),
argMax(is_deleted, version),
max(version)
FROM person_distinct_id2
WHERE
team_id = %(team_id)s
AND version > 0
GROUP BY team_id, distinct_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some arbitrary decisions:

  1. GROUP BY … is optional here — both the source and destination tables will replace on version anyway, so the cardinality is likely similar before and after aggregation here, and the end state would eventually be the same after table optimization.
  2. team_id could probably be optional as well (particularly if aggregation is dropped) and we're okay with inserting a few hundred million rows at once — this might be simpler than keeping track of a progressive rollout?
  3. This doesn't retain the Kafka engine columns from the original table (_timestamp, _partition, _offset), though I suppose it'd be safe to do so given they share the same input topic… maybe worth keeping that around?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this query time out with the group by?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could (though it could also time out without it as well, even though the GROUP BY does make it more likely.) I ran the SELECT without INSERT on some of the larger teams and didn't run into any issues, but past performance not indicative of future results, etc. If we run into any issues, it's easy to remove.

@tkaemming tkaemming requested a review from a team February 27, 2024 05:56
Copy link
Contributor

@tiina303 tiina303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +22 to +32
SELECT
team_id,
distinct_id,
argMax(person_id, version),
argMax(is_deleted, version),
max(version)
FROM person_distinct_id2
WHERE
team_id = %(team_id)s
AND version > 0
GROUP BY team_id, distinct_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this query time out with the group by?

@tkaemming tkaemming merged commit 516493f into master Feb 27, 2024
73 checks passed
@tkaemming tkaemming deleted the poe-simple-backfill branch February 27, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants