-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(persons): Add bulk deletion endpoint #24790
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. A few suggestions to ensure the bulk deletes are efficient and fast
posthog/api/person.py
Outdated
""" | ||
This endpoint allows you to bulk delete persons, either by the PostHog persons ID or by distinct IDs. You can pass through a maximum of 100 ids per call. | ||
""" | ||
if request.data.get("distinct_ids"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total nit: The walrus operator would prevent us having to request.data.get("distinct_ids")
twice in this method
if request.data.get("distinct_ids"): | |
if distinct_ids := request.data.get("distinct_ids"): |
AsyncDeletion.objects.bulk_create( | ||
[ | ||
AsyncDeletion( | ||
deletion_type=DeletionType.Person, | ||
team_id=self.team_id, | ||
key=str(person.uuid), | ||
created_by=cast(User, self.request.user), | ||
) | ||
], | ||
ignore_conflicts=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be up to 100 INSERTs per request, would be great to AsyncDeletion.objects.bulk_create()
outside of the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as #24790 (comment)
|
||
for person in persons: | ||
delete_person(person=person) | ||
self.perform_destroy(person) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not bulk persons.objects.delete()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about deletions failing halfway through, because we can't do transactions in clickhouse. So if we first bulk-deleted all the persons in posthog, but failed to half the people in clickhouse, we'd end up in a weird state. By doing it sequentially, at least the failure will be contained to one person, rather than potentially up to 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't think we can tell if a ClickHouse-side delete fails anyway, because all delete_person()
does is queue a deletion row into Kafka, which is extremely unlikely to fail. So if we first bulk-delete in Postgres, and then emit deletion rows to CH, that should be the highest level of integrity possible in this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is we may not even emit those rows if we fail halfway through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely a tradeoff – in this explicit bulk delete case it doesn't feel great to put this O(n) load on Postgres, but I see what you mean. For maximum integrity this route, we should swap the delete_person(person=person)
and self.perform_destroy(person)
lines though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this doesn't work as the person gets deleted before clickhouse get a chance to delete it. I also think it's more likely for clickhouse to fail, thus the current order does make sense
Co-authored-by: Michael Matloka <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
eb5df07
to
2cf9516
Compare
…lag (#23587) Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Neil Kakkar <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
…thog into bulk-delete-endpoint
Problem
Most users only have distinct_ids saved locally, not our person ID. So in order to delete a person, they first do a call to the persons endpoint to get the UUID, then a call to delete that person. This is inefficient, especially when you have to delete a lot of users
PR has a lot of changes because of #24815, but the actual changes are in this commit
Changes
Create a bulk delete endpoint that can handle both uuids and distinct ids.
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
How did you test this code?