-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(plugin-server): Add flat person override table and writer #19220
Changes from 5 commits
fa3fc28
2ee74a3
4f04b50
9d75860
55c0aaf
24dfd48
3834e98
201f74d
ce1365b
922d11d
845cf50
abe8f8a
61fb17e
72e2c9c
1a1d5a0
796983f
9515c37
fe8c88a
b042f2f
7b8176b
367b735
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -180,6 +180,32 @@ class Meta: | |
|
||
|
||
class PendingPersonOverride(models.Model): | ||
""" | ||
The pending person overrides model/table contains records of merges that | ||
have occurred, but have not yet been integrated into the person overrides | ||
table. | ||
|
||
This table should generally be considered as a log table or queue. When a | ||
merge occurs, it is recorded to the log (added to the queue) as part of the | ||
merge transaction. Later, another process comes along, reading from the | ||
other end of the log (popping from the queue) and applying the necessary | ||
updates to the person overrides table as part of secondary transaction. | ||
|
||
This approach allows us to decouple the set of operations that must occur as | ||
part of an atomic transactional unit during person merging (moving distinct | ||
IDs, merging properties, deleting the subsumed person, etc.) from those that | ||
are more tolerant to eventual consistency (updating person overrides in | ||
Postgres and subsequently relaying those updates to ClickHouse in various | ||
forms to update the person associated with an event.) This decoupling helps | ||
us to minimize the overhead of the primary merge transaction by reducing the | ||
degree of contention within the ingestion pipeline caused by long-running | ||
transactions. This decoupling also allows us to serialize the execution of | ||
all updates to the person overrides table through a single writer, which | ||
allows us to safely update the person overrides table while handling tricky | ||
cases like applying transitive updates without the need for expensive table | ||
constraints to ensure their validity. | ||
""" | ||
|
||
id = models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name="ID") | ||
team_id = models.BigIntegerField() | ||
old_person_id = models.UUIDField() | ||
|
@@ -188,13 +214,50 @@ class PendingPersonOverride(models.Model): | |
|
||
|
||
class FlatPersonOverride(models.Model): | ||
""" | ||
The (flat) person overrides model/table contains a consolidated record of | ||
all merges that have occurred, but have not yet been integrated into the | ||
ClickHouse events table through a squash operation. Once the effects of a | ||
merge have been integrated into the events table, the associated override | ||
record can be deleted from this table. | ||
|
||
This table is in some sense a materialized view over the pending person | ||
overrides table (i.e. the merge log.) It differs from that base table in | ||
that it should be maintained during updates to account for the effects of | ||
transitive merges. For example, if person A is merged into person B, and | ||
then person B is merged into person C, we'd expect the first record (A->B) | ||
to be updated to reflect that person A has been merged into person C (A->C, | ||
eliding the intermediate step.) | ||
|
||
There are several important expectations about the nature of the data within | ||
this table: | ||
|
||
* A person should only appear as an "old" person at most once for a given | ||
team (as appearing more than once would imply they were merged into | ||
multiple people.) | ||
* A person cannot be merged into themselves (i.e. be both the "old" and | ||
"override" person within a given row.) | ||
Comment on lines
+235
to
+239
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know we don't expect to be able to enforce the third item in this list (below this comment) any longer, but I do wonder if it would make sense to try to continue to enforce these (second point is a trivial check constraint, first point would be a probably-not-too-expensive unique constraint over team_id and old_person_id.) Even if there isn't a playbook about what to do in the case that the constraint validation fails (other than just generally panic), it's probably better to know something has gone wrong (and stop overrides processing, even if that means we'll be collecting more potentially erroneous pending overrides) versus not knowing at all and just letting it happen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, these both seem cheap and worthwhile to me. 🤷♂️ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added 'em here: 7b8176b |
||
* A person should only appear in a table as _either_ an "old" person or | ||
"override" person for a given team -- but never both, as this would | ||
indicate a failure to account for a transitive merge. | ||
|
||
The "flat" in the table name is used to distinguish this table from a prior | ||
approach that required multiple tables to maintain the same state but | ||
otherwise has little significance of its own. | ||
""" | ||
|
||
id = models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name="ID") | ||
team_id = models.BigIntegerField() | ||
tkaemming marked this conversation as resolved.
Show resolved
Hide resolved
|
||
old_person_id = models.UUIDField() | ||
override_person_id = models.UUIDField() | ||
oldest_event = models.DateTimeField() | ||
version = models.BigIntegerField(null=True, blank=True) | ||
|
||
class Meta: | ||
indexes = [ | ||
models.Index(fields=["team_id", "override_person_id"]), | ||
] | ||
|
||
|
||
def get_distinct_ids_for_subquery(person: Person | None, team: Team) -> List[str]: | ||
"""_summary_ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If keeping these around is cheap it seems worthwhile to me. There's something nice about having a source of truth in a DB like PG.
But if I'm forgetting something about how squash works then ignore me, it's more of an idle thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are currently deleted after a squash happens, #19347 probably helps shine a lot more light on how that process works -- probably worth picking up this conversation over there? I could see the benefits of not deleting from this table, but would need to give it a closer look to check if that would be safe to do so.