-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use our own soft deletes for event deletes #24582
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern here is how are we going to treat this at query time.
Let's say we have this row:
team_id | timestamp | event | uuid | _timestamp | is_deleted |
---|---|---|---|---|---|
1 | 2024-08-27 | an_event | 72d9d8d3-9bb0-4079-8621-1195560085b6 | 2024-08-27 00:00:00 | False |
And now we delete it, resulting in:
team_id | timestamp | event | uuid | _timestamp | is_deleted |
---|---|---|---|---|---|
1 | 2024-08-27 | an_event | 72d9d8d3-9bb0-4079-8621-1195560085b6 | 2024-08-27 00:00:00 | False |
1 | 2024-08-27 | an_event | 72d9d8d3-9bb0-4079-8621-1195560085b6 | 2024-08-28 00:00:00 | True |
If at query time we just do:
select * from events where not is_deleted;
We are still going to retrieve the row (unless a merge of the part has been performed by ClickHouse), because we would be retrieving the old one with the is_deleted = False
:
team_id | timestamp | event | uuid | _timestamp | is_deleted |
---|---|---|---|---|---|
1 | 2024-08-27 | an_event | 72d9d8d3-9bb0-4079-8621-1195560085b6 | 2024-08-27 00:00:00 | False |
I think we would need to deduplicate first at query time and then filter by is_deleted = False
. Is that something we are taking into account?
Wanted to put it here out loud just in case.
@@ -3,127 +3,127 @@ | |||
''' | |||
|
|||
ALTER TABLE sharded_events ON CLUSTER 'posthog' | |||
DELETE | |||
UPDATE is_deleted = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit strange, but it is like this because for ALTER
statements we are setting replacing_all_numbers
to True
when calling assertQueryMatchesSnapshot
.
I did not want to introduce disruption, so I left it like this, but can be discussed of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be merged.
I'll start working on a new PR to swap the events table with a view. But, meanwhile, we could at least start performing the pending deletes so the mutations are run.
Once the new PR is ready we'll be able to start automatically filtering by the non-deleted events.
The new PR will include an update to this ALTER statement so we reference the renamed events table correctly. Nevermind, the sharded_events
table won't be renamed, just the distributed one.
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
This PR was closed due to lack of activity. Feel free to reopen if it's still relevant. |
Problem
Clickhouse deletes are a nightmare
Changes
Let's implement our own soft delete so we have control over how things actually operate. This will allow us to run this job every day with very little impact on performance.
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
How did you test this code?