feat: Implement person overrides backfill task with adaptive query range selection. #20495

tkaemming · 2024-02-21T23:16:55Z

Problem

This adds a basic management command for backfilling the person_distinct_id_overrides table in ClickHouse for any events where the row's person_id column does not match the latest stored value in the person_distinct_id2 table.

This approach is based on the general design described in the (internal) RFC with a couple of minor changes.

This will likely need some iteration for optimization and robustness, but should be enough to give us a good sense of how realistic this method of backfilling will be.

(This is part of #20460, obviously.)

How did you test this code?

Added tests to cover the basic functionality.

…nge selection.

tkaemming · 2024-02-22T01:06:32Z

posthog/management/commands/backfill_distinct_id_overrides.py

+    def size(self):
+        return self.upper - self.lower
+
+    def split(self) -> Iterator[Range]:


Iterator[Range] here is mostly for future proofing in case it makes sense to split into variable N values instead of a fixed N = 2.

tkaemming · 2024-02-22T01:08:37Z

posthog/management/commands/test/test_backfill_distinct_id_overrides.py

+        for distinct_id, person_id in distinct_ids_to_person_id.items():
+            create_person_distinct_id(
+                team_id=self.team.pk,
+                distinct_id=distinct_id,
+                person_id=str(person_id),
+                version=1,
+            )


This will need to detach the materialized view once #20349 is ready, otherwise test this will start failing or flaking because the version 1 written to overrides will override the version -1.

tkaemming · 2024-02-22T01:10:12Z

posthog/management/commands/backfill_distinct_id_overrides.py

+        if not Team.objects.filter(id=team_id).exists():
+            raise CommandError(f"Team with id={team_id!r} does not exist")
+
+        execute_backfill(BackfillQuery(team_id), dry_run=not live_run)


We'll eventually want to store if/when a team was backfilled, but that can wait until later once we've got a sense of whether or not this is a viable approach in general.

We'll also likely eventually want to be able to provide a cursor value (a Range) to resume backfilling in case of interruption.

tkaemming · 2024-02-22T01:11:47Z

posthog/management/commands/backfill_distinct_id_overrides.py

+def execute_backfill(query: BackfillQuery, dry_run: bool = False) -> None:
+    logger.info(f"Executing %r...", query)
+    try:
+        query.execute(dry_run=dry_run)
+    except ServerException as e:
+        if e.code not in {ErrorCodes.TOO_SLOW, ErrorCodes.TOO_MANY_ROWS}:
+            raise e
+        logger.warn(f"Caught %s when running %r! Trying smaller ranges...", e, query)
+        for chunk in query.split():
+            execute_backfill(chunk)
+    else:
+        logger.info("Successfully executed %r.", query)


This could be part of BackfillQuery.execute, or even some other higher level class — there didn't seem like an obvious best place to put it right now, so I just left it kind of dangling here.

tkaemming · 2024-02-22T01:14:46Z

posthog/management/commands/backfill_distinct_id_overrides.py

+    try:
+        query.execute(dry_run=dry_run)
+    except ServerException as e:
+        if e.code not in {ErrorCodes.TOO_SLOW, ErrorCodes.TOO_MANY_ROWS}:


I'm not sure this set is exhaustive — just going based on what I saw when poking around doing the equivalent of the dry run query a few weeks ago.

tkaemming · 2024-02-22T01:16:59Z

posthog/management/commands/test/test_backfill_distinct_id_overrides.py

+                itertools.repeat(mock.DEFAULT),
+            )
+            execute_backfill(BackfillQuery(self.team.id))
+            assert mock_execute.call_count == 3  # initial query (error), then two queries after splitting


This is pretty clumsy but does the important thing. (I guess if I also mocked execute_backfill I could validate the correct split ranges were being passed, which would help here…)

tkaemming · 2024-02-22T01:18:21Z

posthog/management/commands/test/test_backfill_distinct_id_overrides.py

+    def test_execute_backfill(self) -> None:
+        self.__run_test_backfill(dry_run=False)
+
+    def test_execute_backfill_dry_run(self) -> None:
+        self.__run_test_backfill(dry_run=True)


Doesn't seem like pytest.mark.parametrize works on methods, just plain functions? (… at least I couldn't figure it out quickly.)

tkaemming · 2024-02-22T01:19:19Z

posthog/management/commands/backfill_distinct_id_overrides.py

+            # XXX: Nothing useful to report here, unfortunately... all that is
+            # returned is an empty result set.


tkaemming · 2024-02-22T01:22:27Z

posthog/management/commands/backfill_distinct_id_overrides.py

+    try:
+        query.execute(dry_run=dry_run)
+    except ServerException as e:
+        if e.code not in {ErrorCodes.TOO_SLOW, ErrorCodes.TOO_MANY_ROWS}:
+            raise e
+        logger.warn(f"Caught %s when running %r! Trying smaller ranges...", e, query)
+        for chunk in query.split():
+            execute_backfill(chunk)


There's probably a kinder and more intelligent way to do this than just recursively dividing the keyspace into smaller chunks until it doesn't error, but this seems good enough for now?

After some ad hoc testing, I think it might be more effective to try and set a target size for the right hand side table, and split into how ever many chunks are needed to achieve that target size by using the value of count() / target_size with the presumption that cityHash64(distinct_id) is uniformly distributed. I'm not sure what that target size should be though: figuring that out might require some numberwanging.

As implemented, this has the drawback that queries can either be too slow because the right hand side table is too big (the join is too complex) or because the left hand side table requires scanning too many rows. Taking care of the size of the left hand side likely means breaking down the job by partitions when the size is unwieldily large.

Going to put this back as a draft while I look into doing that.

tiina303 · 2024-02-23T14:30:20Z

is this still a draft or would you like us to review?

tkaemming · 2024-02-23T16:01:02Z

is this still a draft or would you like us to review?

It was still a draft — but since we'll likely change direction to simplify even further (based on this conversation), I'm just going to close this.

tkaemming · 2024-02-26T22:08:00Z

(Superseded by #20562.)

tkaemming added 2 commits February 21, 2024 15:15

feat: Implement person overrides backfill task with adaptive query ra…

072fe75

…nge selection.

Add basic tests.

6a85b4c

tkaemming commented Feb 22, 2024

View reviewed changes

tkaemming requested a review from a team February 22, 2024 01:23

tkaemming marked this pull request as ready for review February 22, 2024 01:23

tkaemming mentioned this pull request Feb 22, 2024

Persons on Events #20460

Closed

10 tasks

tkaemming marked this pull request as draft February 22, 2024 22:51

tkaemming closed this Feb 23, 2024

tkaemming deleted the distinct-id-overrides-backfill branch February 26, 2024 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement person overrides backfill task with adaptive query range selection. #20495

feat: Implement person overrides backfill task with adaptive query range selection. #20495

tkaemming commented Feb 21, 2024 •

edited

Loading

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tkaemming Feb 22, 2024

tiina303 commented Feb 23, 2024

tkaemming commented Feb 23, 2024

tkaemming commented Feb 26, 2024

		# XXX: Nothing useful to report here, unfortunately... all that is
		# returned is an empty result set.

feat: Implement person overrides backfill task with adaptive query range selection. #20495

feat: Implement person overrides backfill task with adaptive query range selection. #20495

Conversation

tkaemming commented Feb 21, 2024 • edited Loading

Problem

How did you test this code?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiina303 commented Feb 23, 2024

tkaemming commented Feb 23, 2024

tkaemming commented Feb 26, 2024

tkaemming commented Feb 21, 2024 •

edited

Loading