Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(plugin-server): Add flat person override table and writer #19220

Merged
merged 21 commits into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fa3fc28
Add basic tests for ensuring implementations remain in sync.
tkaemming Dec 6, 2023
2ee74a3
Add flat table override definition.
tkaemming Dec 7, 2023
4f04b50
Add automatic migration.
tkaemming Dec 7, 2023
9d75860
Add flat person override writer and add to override writer test suite.
tkaemming Dec 7, 2023
55c0aaf
Improve test structure.
tkaemming Dec 7, 2023
24dfd48
Provide the override writer as a constructor argument to the override…
tkaemming Dec 7, 2023
3834e98
Consolidate tests on the `getPersonOverrides` writer method for easie…
tkaemming Dec 7, 2023
201f74d
Use set instead of sorted arrays.
tkaemming Dec 7, 2023
ce1365b
Add new test variant for flat overrides to the big test suite.
tkaemming Dec 7, 2023
922d11d
Test twiddling
tkaemming Dec 7, 2023
845cf50
Add `POE_DEFERRED_WRITES_USE_FLAT_OVERRIDES` setting.
tkaemming Dec 7, 2023
abe8f8a
Remove comment about foreign key on team_id.
tkaemming Dec 8, 2023
61fb17e
Split up non-deferred from deferred overrides, remove another comment.
tkaemming Dec 8, 2023
72e2c9c
Use `serverConfig` instead of optional `hub` when setting up
tkaemming Dec 15, 2023
1a1d5a0
Add index
tkaemming Dec 15, 2023
796983f
Write comment for pending overrides.
tkaemming Dec 15, 2023
9515c37
Tweak woring
tkaemming Dec 15, 2023
fe8c88a
Add comment to flat person model.
tkaemming Dec 15, 2023
b042f2f
Remove errant punctuation
tkaemming Dec 16, 2023
7b8176b
Add back constraints.
tkaemming Dec 18, 2023
367b735
Merge branch 'master' into poe-deferred-override-worker-flat, update …
tkaemming Dec 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions posthog/migrations/0376_flatpersonoverride.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,8 @@ class Migration(migrations.Migration):
("version", models.BigIntegerField(blank=True, null=True)),
],
),
migrations.AddIndex(
model_name="flatpersonoverride",
index=models.Index(fields=["team_id", "override_person_id"], name="posthog_fla_team_id_224253_idx"),
),
]
63 changes: 63 additions & 0 deletions posthog/models/person/person.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,32 @@ class Meta:


class PendingPersonOverride(models.Model):
"""
The pending person overrides model/table contains records of merges that
have occurred, but have not yet been integrated into the person overrides
table.

This table should generally be considered as a log table or queue. When a
merge occurs, it is recorded to the log (added to the queue) as part of the
merge transaction. Later, another process comes along, reading from the
other end of the log (popping from the queue) and applying the necessary
updates to the person overrides table as part of secondary transaction.

This approach allows us to decouple the set of operations that must occur as
part of an atomic transactional unit during person merging (moving distinct
IDs, merging properties, deleting the subsumed person, etc.) from those that
are more tolerant to eventual consistency (updating person overrides in
Postgres and subsequently relaying those updates to ClickHouse in various
forms to update the person associated with an event.) This decoupling helps
us to minimize the overhead of the primary merge transaction by reducing the
degree of contention within the ingestion pipeline caused by long-running
transactions. This decoupling also allows us to serialize the execution of
all updates to the person overrides table through a single writer, which
allows us to safely update the person overrides table while handling tricky
cases like applying transitive updates without the need for expensive table
constraints to ensure their validity.
"""

id = models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name="ID")
team_id = models.BigIntegerField()
old_person_id = models.UUIDField()
Expand All @@ -188,13 +214,50 @@ class PendingPersonOverride(models.Model):


class FlatPersonOverride(models.Model):
"""
The (flat) person overrides model/table contains a consolidated record of
all merges that have occurred, but have not yet been integrated into the
ClickHouse events table through a squash operation. Once the effects of a
merge have been integrated into the events table, the associated override
record can be deleted from this table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If keeping these around is cheap it seems worthwhile to me. There's something nice about having a source of truth in a DB like PG.

But if I'm forgetting something about how squash works then ignore me, it's more of an idle thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are currently deleted after a squash happens, #19347 probably helps shine a lot more light on how that process works -- probably worth picking up this conversation over there? I could see the benefits of not deleting from this table, but would need to give it a closer look to check if that would be safe to do so.


This table is in some sense a materialized view over the pending person
overrides table (i.e. the merge log.) It differs from that base table in
that it should be maintained during updates to account for the effects of
transitive merges. For example, if person A is merged into person B, and
then person B is merged into person C, we'd expect the first record (A->B)
to be updated to reflect that person A has been merged into person C (A->C,
eliding the intermediate step.)

There are several important expectations about the nature of the data within
this table:

* A person should only appear as an "old" person at most once for a given
team (as appearing more than once would imply they were merged into
multiple people.)
* A person cannot be merged into themselves (i.e. be both the "old" and
"override" person within a given row.)
Comment on lines +235 to +239
Copy link
Contributor Author

@tkaemming tkaemming Dec 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't expect to be able to enforce the third item in this list (below this comment) any longer, but I do wonder if it would make sense to try to continue to enforce these (second point is a trivial check constraint, first point would be a probably-not-too-expensive unique constraint over team_id and old_person_id.)

Even if there isn't a playbook about what to do in the case that the constraint validation fails (other than just generally panic), it's probably better to know something has gone wrong (and stop overrides processing, even if that means we'll be collecting more potentially erroneous pending overrides) versus not knowing at all and just letting it happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these both seem cheap and worthwhile to me. 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 'em here: 7b8176b

* A person should only appear in a table as _either_ an "old" person or
"override" person for a given team -- but never both, as this would
indicate a failure to account for a transitive merge.

The "flat" in the table name is used to distinguish this table from a prior
approach that required multiple tables to maintain the same state but
otherwise has little significance of its own.
"""

id = models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name="ID")
team_id = models.BigIntegerField()
tkaemming marked this conversation as resolved.
Show resolved Hide resolved
old_person_id = models.UUIDField()
override_person_id = models.UUIDField()
oldest_event = models.DateTimeField()
version = models.BigIntegerField(null=True, blank=True)

class Meta:
indexes = [
models.Index(fields=["team_id", "override_person_id"]),
]


def get_distinct_ids_for_subquery(person: Person | None, team: Team) -> List[str]:
"""_summary_
Expand Down
Loading