Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(persons): Add bulk deletion endpoint #24790

Merged
merged 68 commits into from
Sep 10, 2024
Merged

feat(persons): Add bulk deletion endpoint #24790

merged 68 commits into from
Sep 10, 2024

Conversation

timgl
Copy link
Collaborator

@timgl timgl commented Sep 4, 2024

Problem

Most users only have distinct_ids saved locally, not our person ID. So in order to delete a person, they first do a call to the persons endpoint to get the UUID, then a call to delete that person. This is inefficient, especially when you have to delete a lot of users

PR has a lot of changes because of #24815, but the actual changes are in this commit

Changes

Create a bulk delete endpoint that can handle both uuids and distinct ids.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

@timgl timgl requested a review from Twixes September 4, 2024 17:01
Copy link
Member

@Twixes Twixes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. A few suggestions to ensure the bulk deletes are efficient and fast

posthog/api/person.py Outdated Show resolved Hide resolved
posthog/api/person.py Outdated Show resolved Hide resolved
posthog/api/person.py Outdated Show resolved Hide resolved
"""
This endpoint allows you to bulk delete persons, either by the PostHog persons ID or by distinct IDs. You can pass through a maximum of 100 ids per call.
"""
if request.data.get("distinct_ids"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit: The walrus operator would prevent us having to request.data.get("distinct_ids") twice in this method

Suggested change
if request.data.get("distinct_ids"):
if distinct_ids := request.data.get("distinct_ids"):

Comment on lines +448 to +458
AsyncDeletion.objects.bulk_create(
[
AsyncDeletion(
deletion_type=DeletionType.Person,
team_id=self.team_id,
key=str(person.uuid),
created_by=cast(User, self.request.user),
)
],
ignore_conflicts=True,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be up to 100 INSERTs per request, would be great to AsyncDeletion.objects.bulk_create() outside of the loop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem as #24790 (comment)


for person in persons:
delete_person(person=person)
self.perform_destroy(person)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not bulk persons.objects.delete()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about deletions failing halfway through, because we can't do transactions in clickhouse. So if we first bulk-deleted all the persons in posthog, but failed to half the people in clickhouse, we'd end up in a weird state. By doing it sequentially, at least the failure will be contained to one person, rather than potentially up to 100.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think we can tell if a ClickHouse-side delete fails anyway, because all delete_person() does is queue a deletion row into Kafka, which is extremely unlikely to fail. So if we first bulk-delete in Postgres, and then emit deletion rows to CH, that should be the highest level of integrity possible in this situation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we may not even emit those rows if we fail halfway through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely a tradeoff – in this explicit bulk delete case it doesn't feel great to put this O(n) load on Postgres, but I see what you mean. For maximum integrity this route, we should swap the delete_person(person=person) and self.perform_destroy(person) lines though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this doesn't work as the person gets deleted before clickhouse get a chance to delete it. I also think it's more likely for clickhouse to fail, thus the current order does make sense

posthog/api/person.py Show resolved Hide resolved
posthog/api/person.py Outdated Show resolved Hide resolved
posthog/api/person.py Outdated Show resolved Hide resolved
@timgl timgl force-pushed the bulk-delete-endpoint branch from eb5df07 to 2cf9516 Compare September 5, 2024 15:53
@Twixes Twixes changed the title feat: Add bulk deletion endpoint feat(persons): Add bulk deletion endpoint Sep 6, 2024
pauldambra and others added 26 commits September 10, 2024 11:35
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michael Matloka <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
Co-authored-by: Michael Matloka <[email protected]>
@timgl timgl enabled auto-merge (squash) September 10, 2024 10:49
@timgl timgl merged commit c9bf238 into master Sep 10, 2024
84 checks passed
@timgl timgl deleted the bulk-delete-endpoint branch September 10, 2024 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.