feat(persons): Add simple backfill command for distinct ID overrides. #20562

tkaemming · 2024-02-26T21:45:38Z

Problem

We need a convenient way to populate the person_distinct_id_overrides table (#20326) with data existed prior to the Kafka engine and materialized view (#20349.)

This is a simplified version of the initial backfill procedure idea. More context about the change in approach can be found here. This essentially acts as a limited POPULATE.

How did you test this code?

Added a (somewhat contrived) test.

Also adds a fairly contrived test case, as I'd already written most of it anyway...

tkaemming · 2024-02-26T22:01:36Z

posthog/management/commands/backfill_distinct_id_overrides.py

+            SELECT
+                team_id,
+                distinct_id,
+                argMax(person_id, version),
+                argMax(is_deleted, version),
+                max(version)
+            FROM person_distinct_id2
+            WHERE
+                team_id = %(team_id)s
+                AND version > 0
+            GROUP BY team_id, distinct_id


Some arbitrary decisions:

GROUP BY … is optional here — both the source and destination tables will replace on version anyway, so the cardinality is likely similar before and after aggregation here, and the end state would eventually be the same after table optimization.

team_id could probably be optional as well (particularly if aggregation is dropped) and we're okay with inserting a few hundred million rows at once — this might be simpler than keeping track of a progressive rollout?

This doesn't retain the Kafka engine columns from the original table (_timestamp, _partition, _offset), though I suppose it'd be safe to do so given they share the same input topic… maybe worth keeping that around?

Could this query time out with the group by?

It could (though it could also time out without it as well, even though the GROUP BY does make it more likely.) I ran the SELECT without INSERT on some of the larger teams and didn't run into any issues, but past performance not indicative of future results, etc. If we run into any issues, it's easy to remove.

tiina303

lgtm

tiina303 · 2024-02-27T17:21:22Z

posthog/management/commands/backfill_distinct_id_overrides.py

+            SELECT
+                team_id,
+                distinct_id,
+                argMax(person_id, version),
+                argMax(is_deleted, version),
+                max(version)
+            FROM person_distinct_id2
+            WHERE
+                team_id = %(team_id)s
+                AND version > 0
+            GROUP BY team_id, distinct_id


Could this query time out with the group by?

feat(persons): Add simple backfill command for distinct ID overrides.

dc7d9e1

Also adds a fairly contrived test case, as I'd already written most of it anyway...

tkaemming marked this pull request as ready for review February 26, 2024 22:06

Update query snapshots

b4a6e5e

tkaemming commented Feb 26, 2024

View reviewed changes

This was referenced Feb 26, 2024

feat: Implement person overrides backfill task with adaptive query range selection. #20495

Closed

Persons on Events #20460

Closed

Merge branch 'master' into poe-simple-backfill

4fc94e5

tkaemming requested a review from a team February 27, 2024 05:56

tiina303 approved these changes Feb 27, 2024

View reviewed changes

tkaemming merged commit 516493f into master Feb 27, 2024
73 checks passed

tkaemming deleted the poe-simple-backfill branch February 27, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(persons): Add simple backfill command for distinct ID overrides. #20562

feat(persons): Add simple backfill command for distinct ID overrides. #20562

tkaemming commented Feb 26, 2024 •

edited

Loading

tkaemming Feb 26, 2024

tiina303 Feb 27, 2024

tkaemming Feb 27, 2024

tiina303 left a comment

tiina303 Feb 27, 2024

feat(persons): Add simple backfill command for distinct ID overrides. #20562

feat(persons): Add simple backfill command for distinct ID overrides. #20562

Conversation

tkaemming commented Feb 26, 2024 • edited Loading

Problem

How did you test this code?

tkaemming Feb 26, 2024

Choose a reason for hiding this comment

tiina303 Feb 27, 2024

Choose a reason for hiding this comment

tkaemming Feb 27, 2024

Choose a reason for hiding this comment

tiina303 left a comment

Choose a reason for hiding this comment

tiina303 Feb 27, 2024

Choose a reason for hiding this comment

tkaemming commented Feb 26, 2024 •

edited

Loading