-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(persons): Add simple backfill command for distinct ID overrides. #20562
Conversation
Also adds a fairly contrived test case, as I'd already written most of it anyway...
SELECT | ||
team_id, | ||
distinct_id, | ||
argMax(person_id, version), | ||
argMax(is_deleted, version), | ||
max(version) | ||
FROM person_distinct_id2 | ||
WHERE | ||
team_id = %(team_id)s | ||
AND version > 0 | ||
GROUP BY team_id, distinct_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some arbitrary decisions:
GROUP BY …
is optional here — both the source and destination tables will replace onversion
anyway, so the cardinality is likely similar before and after aggregation here, and the end state would eventually be the same after table optimization.team_id
could probably be optional as well (particularly if aggregation is dropped) and we're okay with inserting a few hundred million rows at once — this might be simpler than keeping track of a progressive rollout?- This doesn't retain the Kafka engine columns from the original table (
_timestamp
,_partition
,_offset
), though I suppose it'd be safe to do so given they share the same input topic… maybe worth keeping that around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this query time out with the group by
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could (though it could also time out without it as well, even though the GROUP BY
does make it more likely.) I ran the SELECT
without INSERT
on some of the larger teams and didn't run into any issues, but past performance not indicative of future results, etc. If we run into any issues, it's easy to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
SELECT | ||
team_id, | ||
distinct_id, | ||
argMax(person_id, version), | ||
argMax(is_deleted, version), | ||
max(version) | ||
FROM person_distinct_id2 | ||
WHERE | ||
team_id = %(team_id)s | ||
AND version > 0 | ||
GROUP BY team_id, distinct_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this query time out with the group by
?
Problem
We need a convenient way to populate the
person_distinct_id_overrides
table (#20326) with data existed prior to the Kafka engine and materialized view (#20349.)This is a simplified version of the initial backfill procedure idea. More context about the change in approach can be found here. This essentially acts as a limited
POPULATE
.How did you test this code?
Added a (somewhat contrived) test.