Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(persons): Add bulk deletion endpoint #24790

Merged
merged 68 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
bf66822
feat: Add bulk deletion endpoint
timgl Sep 4, 2024
d0aa06f
fix
timgl Sep 5, 2024
5269bb7
Update posthog/api/person.py
timgl Sep 5, 2024
e90f3c7
Update posthog/api/person.py
timgl Sep 5, 2024
2cf9516
Update posthog/api/person.py
timgl Sep 5, 2024
5f65b12
nit
timgl Sep 5, 2024
86b966e
docs nit
timgl Sep 5, 2024
5e25cc7
Remove from Postgres before ClickHouse emit
Twixes Sep 6, 2024
8d5053d
Fix `person` instance access
Twixes Sep 9, 2024
4196136
chore: allow exporting mobile recordings while impersonating a user (…
pauldambra Sep 5, 2024
5c72dda
feat(onboarding-templates): create custom variable selector panel (#2…
raquelmsmith Sep 5, 2024
380f683
feat(insights): launch funnels as a Clickhouse UDF behind a feature f…
aspicer Sep 5, 2024
f5791bb
fix(hedgehog): Fix hedgehog in profile pictures (#24813)
Twixes Sep 5, 2024
68d7252
feat: add sdk target area to support form (#24827)
MarconLP Sep 5, 2024
49629bf
fix: Removed old heartbeat metric (#24829)
benjackwhite Sep 6, 2024
0da6e88
feat: Switch batch exports logging to async (#24819)
tomasfarias Sep 6, 2024
dacc610
fix: rrweb player now has a destroy method (#24826)
pauldambra Sep 6, 2024
4f2c08d
chore: refactor RedisLimiter to update async out of the hot path (#24…
frankh Sep 6, 2024
be8187a
feat(cdp): sendgrid template updates (#24814)
mariusandra Sep 6, 2024
28f3367
fix(hog): bools and numbers are not "empty" (#24835)
mariusandra Sep 6, 2024
91cd660
feat(propdefs): add filtering to allow for gradual rollout (#24820)
oliverb123 Sep 6, 2024
97e8ec2
feat: show absolute time in seekbar preview (#24837)
daibhin Sep 6, 2024
682e8f9
fix(propdefs): push all updates into batch before releasing (#24841)
oliverb123 Sep 6, 2024
b399e5f
fix: patch mobile recordings that are missing their meta event (#24840)
pauldambra Sep 6, 2024
e08c403
fix(billing): Avoid double period in over limit warning (#24831)
Twixes Sep 6, 2024
2dcb74f
ci: Niceify Rust CI naming (#24834)
Twixes Sep 6, 2024
c7b854f
chore(frontend): Kill @ant-design/icons (#24817)
Twixes Sep 6, 2024
af48da5
feat(kafka-producer): ping kafka brokers (#24836)
Elvis339 Sep 6, 2024
4507c31
feat(data-warehouse): modeling ui (#24587)
EDsCODE Sep 6, 2024
e391da0
feat: shared dashboards allow viewer to preview filter change (#24845)
anirudhpillai Sep 9, 2024
39ea30f
refactor: Replace style props with css classes (#24732)
anirudh24seven Sep 9, 2024
9bfb4a7
fix: activity log stories (#24851)
pauldambra Sep 9, 2024
e8596cb
feat: get property definitions out of plugin-server (#24843)
oliverb123 Sep 9, 2024
aca1bae
fix: big int in events query (#24847)
aspicer Sep 9, 2024
431d733
fix(docs): Document insights API `refresh` param correctly (#24830)
Twixes Sep 9, 2024
338686a
fix(web-analytics): Fix infinite loop when setting the date range to …
robbie-c Sep 9, 2024
6f17af9
feat: group status & assignee actions (#24821)
daibhin Sep 9, 2024
e317d39
chore(deps): Update posthog-js to 1.161.0 (#24860)
posthog-bot Sep 9, 2024
be3d8f0
feat(cdp): add avo template (#24705)
MarconLP Sep 9, 2024
6221da0
feat(experiments): set up experiment result query runner (#24842)
jurajmajerik Sep 9, 2024
bb122d1
feat(web-analytics): Improve the outbound link clicks tile (#24748)
robbie-c Sep 9, 2024
3fb562c
fix: make the version checker banner team aware (#24861)
pauldambra Sep 9, 2024
5cd21ea
feat: web vitals metrics allowlist (#24850)
pauldambra Sep 9, 2024
7777500
fix(dashboards): Make sure loading dashboard items does not `POST /qu…
Twixes Sep 9, 2024
ec97dd8
fix(dashboards): Ensure text card draggability (#24859)
Twixes Sep 9, 2024
a42d0fc
chore(feature flags): tooltip for the `Match evaluation` column (#24862)
jurajmajerik Sep 9, 2024
b1c688c
feat: Hedgehog mode skins (#24838)
benjackwhite Sep 9, 2024
1fb2ec9
chore(deps): Update posthog-js to 1.161.1 (#24866)
posthog-bot Sep 9, 2024
46c9a32
chore(data-warehouse): enable logging (#24867)
EDsCODE Sep 9, 2024
5630444
chore(data-warehouse): add psutil (#24875)
EDsCODE Sep 10, 2024
ca766ee
feat(cdp): Use cyclotron part 2 (#24746)
benjackwhite Sep 10, 2024
5bb8cc6
chore(data-warehouse): update deploy conditions for temporal workers …
EDsCODE Sep 10, 2024
7a67664
chore: Update readme for web analytics launch (#24812)
joethreepwood Sep 10, 2024
47e860b
feat: add replay overflow limiter to rust capture (#24803)
frankh Sep 10, 2024
5c875d5
fxi
timgl Sep 10, 2024
bb05558
feat: Add bulk deletion endpoint
timgl Sep 4, 2024
dceb26f
fix
timgl Sep 5, 2024
b7e8a75
Update posthog/api/person.py
timgl Sep 5, 2024
735c969
Update posthog/api/person.py
timgl Sep 5, 2024
0ff2391
Update posthog/api/person.py
timgl Sep 5, 2024
0bb1f89
nit
timgl Sep 5, 2024
93857ed
docs nit
timgl Sep 5, 2024
8cd543b
Remove from Postgres before ClickHouse emit
Twixes Sep 6, 2024
3c5f554
Fix `person` instance access
Twixes Sep 9, 2024
9a29853
fxi
timgl Sep 10, 2024
a5d1230
Merge branch 'master' into bulk-delete-endpoint
timgl Sep 10, 2024
ac40e2a
Merge branch 'bulk-delete-endpoint' of https://github.com/posthog/pos…
timgl Sep 10, 2024
37af305
fix
timgl Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 79 additions & 11 deletions posthog/api/person.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,12 +353,15 @@ def list(self, request: request.Request, *args: Any, **kwargs: Any) -> response.
OpenApiParameter(
"delete_events",
OpenApiTypes.BOOL,
description="If true, a task to delete all events associated with this person will be created and queued. The task does not run immediately and instead is batched together and at 5AM UTC every Sunday (controlled by environment variable CLEAR_CLICKHOUSE_REMOVED_DATA_SCHEDULE_CRON)",
description="If true, a task to delete all events associated with this person will be created and queued. The task does not run immediately and instead is batched together and at 5AM UTC every Sunday",
default=False,
),
],
)
def destroy(self, request: request.Request, pk=None, **kwargs):
"""
Use this endpoint to delete individual persons. For bulk deletion, use the bulk_delete endpoint instead.
"""
try:
person = self.get_object()
person_id = person.id
Expand Down Expand Up @@ -391,6 +394,70 @@ def destroy(self, request: request.Request, pk=None, **kwargs):
except Person.DoesNotExist:
raise NotFound(detail="Person not found.")

@extend_schema(
parameters=[
OpenApiParameter(
"delete_events",
OpenApiTypes.BOOL,
description="If true, a task to delete all events associated with this person will be created and queued. The task does not run immediately and instead is batched together and at 5AM UTC every Sunday",
default=False,
),
OpenApiParameter(
"distinct_ids",
OpenApiTypes.OBJECT,
description="A list of distinct IDs, up to 100 of them. We'll delete all persons associated with those distinct IDs.",
),
OpenApiParameter(
"ids",
OpenApiTypes.OBJECT,
description="A list of PostHog person IDs, up to 100 of them. We'll delete all the persons listed.",
),
],
)
@action(methods=["POST"], detail=False, required_scopes=["person:write"])
def bulk_delete(self, request: request.Request, pk=None, **kwargs):
"""
This endpoint allows you to bulk delete persons, either by the PostHog person IDs or by distinct IDs. You can pass in a maximum of 100 IDs per call.
"""
if distinct_ids := request.data.get("distinct_ids"):
if len(distinct_ids) > 100:
raise ValidationError("You can only pass 100 distinct_ids in one call")
persons = self.get_queryset().filter(persondistinctid__distinct_id__in=distinct_ids)
elif ids := request.data.get("ids"):
if len(ids) > 100:
raise ValidationError("You can only pass 100 ids in one call")
persons = self.get_queryset().filter(uuid__in=ids)
else:
raise ValidationError("You need to specify either distinct_ids or ids")

for person in persons:
delete_person(person=person)
self.perform_destroy(person)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not bulk persons.objects.delete()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about deletions failing halfway through, because we can't do transactions in clickhouse. So if we first bulk-deleted all the persons in posthog, but failed to half the people in clickhouse, we'd end up in a weird state. By doing it sequentially, at least the failure will be contained to one person, rather than potentially up to 100.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think we can tell if a ClickHouse-side delete fails anyway, because all delete_person() does is queue a deletion row into Kafka, which is extremely unlikely to fail. So if we first bulk-delete in Postgres, and then emit deletion rows to CH, that should be the highest level of integrity possible in this situation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we may not even emit those rows if we fail halfway through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely a tradeoff – in this explicit bulk delete case it doesn't feel great to put this O(n) load on Postgres, but I see what you mean. For maximum integrity this route, we should swap the delete_person(person=person) and self.perform_destroy(person) lines though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this doesn't work as the person gets deleted before clickhouse get a chance to delete it. I also think it's more likely for clickhouse to fail, thus the current order does make sense

log_activity(
organization_id=self.organization.id,
team_id=self.team_id,
user=cast(User, request.user),
was_impersonated=is_impersonated_session(request),
item_id=person.id,
scope="Person",
activity="deleted",
detail=Detail(name=str(person.uuid)),
)
# Once the person is deleted, queue deletion of associated data, if that was requested
if request.data.get("delete_events"):
AsyncDeletion.objects.bulk_create(
[
AsyncDeletion(
deletion_type=DeletionType.Person,
team_id=self.team_id,
key=str(person.uuid),
created_by=cast(User, self.request.user),
)
],
ignore_conflicts=True,
)
Comment on lines +448 to +458
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be up to 100 INSERTs per request, would be great to AsyncDeletion.objects.bulk_create() outside of the loop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem as #24790 (comment)

return response.Response(status=202)

@action(methods=["GET"], detail=False, required_scopes=["person:read"])
def values(self, request: request.Request, **kwargs) -> response.Response:
key = request.GET.get("key")
Expand Down Expand Up @@ -636,16 +703,17 @@ def _set_properties(self, properties, user):
},
)

log_activity(
organization_id=self.organization.id,
team_id=self.team.id,
user=user,
was_impersonated=is_impersonated_session(self.request),
item_id=instance.pk,
scope="Person",
activity="updated",
detail=Detail(changes=[Change(type="Person", action="changed", field="properties")]),
)
if self.organization.id: # should always be true, but mypy...
Twixes marked this conversation as resolved.
Show resolved Hide resolved
log_activity(
organization_id=self.organization.id,
team_id=self.team.id,
user=user,
was_impersonated=is_impersonated_session(self.request),
item_id=instance.pk,
scope="Person",
activity="updated",
detail=Detail(changes=[Change(type="Person", action="changed", field="properties")]),
)

# PRAGMA: Methods for getting Persons via clickhouse queries
def _respond_with_cached_results(
Expand Down
81 changes: 81 additions & 0 deletions posthog/api/test/test_person.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,87 @@ def test_delete_person_and_events(self):
self.assertEqual(async_deletion.key, str(person.uuid))
self.assertIsNone(async_deletion.delete_verified_at)

@freeze_time("2021-08-25T22:09:14.252Z")
def test_bulk_delete_ids(self):
person = _create_person(
team=self.team,
distinct_ids=["person_1", "anonymous_id"],
properties={"$os": "Chrome"},
immediate=True,
)
person2 = _create_person(
team=self.team,
distinct_ids=["person_2", "anonymous_id_2"],
properties={"$os": "Chrome"},
immediate=True,
)
_create_event(event="test", team=self.team, distinct_id="person_1")
_create_event(event="test", team=self.team, distinct_id="anonymous_id")
_create_event(event="test", team=self.team, distinct_id="someone_else")

response = self.client.post(
f"/api/person/bulk_delete/", {"ids": [person.uuid, person2.uuid], "delete_events": True}
)

self.assertEqual(response.status_code, status.HTTP_202_ACCEPTED, response.content)
self.assertEqual(response.content, b"") # Empty response
self.assertEqual(Person.objects.filter(team=self.team).count(), 0)

response = self.client.delete(f"/api/person/{person.uuid}/")
self.assertEqual(response.status_code, status.HTTP_404_NOT_FOUND)

ch_persons = sync_execute(
"SELECT version, is_deleted, properties FROM person FINAL WHERE team_id = %(team_id)s and id = %(uuid)s",
{"team_id": self.team.pk, "uuid": person.uuid},
)
self.assertEqual([(100, 1, "{}")], ch_persons)

# async deletion scheduled and executed
async_deletion = cast(AsyncDeletion, AsyncDeletion.objects.filter(team_id=self.team.id).first())
self.assertEqual(async_deletion.deletion_type, DeletionType.Person)
self.assertEqual(async_deletion.key, str(person.uuid))
self.assertIsNone(async_deletion.delete_verified_at)

@freeze_time("2021-08-25T22:09:14.252Z")
def test_bulk_delete_distinct_id(self):
person = _create_person(
team=self.team,
distinct_ids=["person_1", "anonymous_id"],
properties={"$os": "Chrome"},
immediate=True,
)
_create_person(
team=self.team,
distinct_ids=["person_2", "anonymous_id_2"],
properties={"$os": "Chrome"},
immediate=True,
)
_create_event(event="test", team=self.team, distinct_id="person_1")
_create_event(event="test", team=self.team, distinct_id="anonymous_id")
_create_event(event="test", team=self.team, distinct_id="someone_else")

response = self.client.post(f"/api/person/bulk_delete/", {"distinct_ids": ["anonymous_id", "person_2"]})

self.assertEqual(response.status_code, status.HTTP_202_ACCEPTED, response.content)
self.assertEqual(response.content, b"") # Empty response
self.assertEqual(Person.objects.filter(team=self.team).count(), 0)

response = self.client.delete(f"/api/person/{person.uuid}/")
self.assertEqual(response.status_code, status.HTTP_404_NOT_FOUND)

ch_persons = sync_execute(
"SELECT version, is_deleted, properties FROM person FINAL WHERE team_id = %(team_id)s and id = %(uuid)s",
{"team_id": self.team.pk, "uuid": person.uuid},
)
self.assertEqual([(100, 1, "{}")], ch_persons)
# No async deletion is scheduled
self.assertEqual(AsyncDeletion.objects.filter(team_id=self.team.id).count(), 0)
ch_events = sync_execute(
"SELECT count() FROM events WHERE team_id = %(team_id)s",
{"team_id": self.team.pk},
)[0][0]
self.assertEqual(ch_events, 3)

@freeze_time("2021-08-25T22:09:14.252Z")
def test_split_people_keep_props(self) -> None:
# created first
Expand Down
4 changes: 2 additions & 2 deletions posthog/models/person/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,9 +232,9 @@ def get_persons_by_uuids(team: Team, uuids: list[str]) -> QuerySet:
def delete_person(person: Person, sync: bool = False) -> None:
# This is racy https://github.com/PostHog/posthog/issues/11590
distinct_ids_to_version = _get_distinct_ids_with_version(person)
_delete_person(person.team.id, person.uuid, int(person.version or 0), person.created_at, sync)
_delete_person(person.team_id, person.uuid, int(person.version or 0), person.created_at, sync)
for distinct_id, version in distinct_ids_to_version.items():
_delete_ch_distinct_id(person.team.id, person.uuid, distinct_id, version, sync)
_delete_ch_distinct_id(person.team_id, person.uuid, distinct_id, version, sync)


def _delete_person(
Expand Down
Loading