Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414

Merged
merged 10 commits into from
Nov 8, 2023
6 changes: 6 additions & 0 deletions posthog/models/filters/test/__snapshots__/test_filter.ambr
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@
WHERE ("posthog_persondistinctid"."person_id" = 2
AND "posthog_persondistinctid"."team_id" = 2)
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500
'
---
# name: TestDjangoPropertyGroupToQ.test_property_group_to_q_with_cohorts
Expand Down Expand Up @@ -367,6 +368,7 @@
WHERE ("posthog_persondistinctid"."person_id" = 2
AND "posthog_persondistinctid"."team_id" = 2)
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500
'
---
# name: TestDjangoPropertyGroupToQ.test_property_group_to_q_with_cohorts.3
Expand All @@ -376,6 +378,7 @@
WHERE ("posthog_persondistinctid"."person_id" = 2
AND "posthog_persondistinctid"."team_id" = 2)
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500
'
---
# name: TestDjangoPropertyGroupToQ.test_property_group_to_q_with_cohorts_no_match
Expand Down Expand Up @@ -439,6 +442,7 @@
WHERE ("posthog_persondistinctid"."person_id" = 2
AND "posthog_persondistinctid"."team_id" = 2)
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500
'
---
# name: TestDjangoPropertyGroupToQ.test_property_group_to_q_with_negation_cohorts
Expand Down Expand Up @@ -571,6 +575,7 @@
WHERE ("posthog_persondistinctid"."person_id" = 2
AND "posthog_persondistinctid"."team_id" = 2)
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500
'
---
# name: TestDjangoPropertyGroupToQ.test_property_group_to_q_with_negation_cohorts.6
Expand All @@ -580,5 +585,6 @@
WHERE ("posthog_persondistinctid"."person_id" = 2
AND "posthog_persondistinctid"."team_id" = 2)
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500
'
---
4 changes: 3 additions & 1 deletion posthog/models/person/person.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

from posthog.models.utils import UUIDT

MAX_LIMIT_DISTINCT_IDS = 2500


class PersonManager(models.Manager):
def create(self, *args: Any, **kwargs: Any):
Expand Down Expand Up @@ -34,7 +36,7 @@ def distinct_ids(self) -> List[str]:
id[0]
for id in PersonDistinctId.objects.filter(person=self, team_id=self.team_id)
.order_by("id")
.values_list("distinct_id")
.values_list("distinct_id")[:MAX_LIMIT_DISTINCT_IDS]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... order by is ascending by default and id is numeric...

So this is the oldest 2500 IDs

So, what happens if I am interested in ID 2501.

E.g. "load me all events for person X" now becomes "load me all events for the first 2500 ids attached to person X"

I can't think of a good solution for this without person on events in place. And since distinct id to person mapping isn't in ClickHouse (I assume) you can't easily join

But this affects everyone querying the Person model even if they want all of the distinct ids.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.

Yeah, don't like this solution too much, querying for 10,000+ distinct ids isn't really the issue as far as I can tell, it's when you pass it in somewhere that bad things happen (like in another query, or returning via API).

This also leads to an issue of not being able to trust person.distinct_ids to return the right things.

Sooo.. not sure what's best here, but if this is just for events view for a given person, I'd scope the change to the event_query_runner.. And if this query runs only on a given person's page, maybe the join is ok to do, since this doesn't affect the default event query, which has no personID filter? Wouldn't be as fast definitely, but correctness seems more important here, since events can be 'orphaned' otherwise? Or, programatically switch between the optimal & non-optimal version depending on no. of distinct ids. Less than 2000, go for the fast version, since all ids will be included. More, do the join, best of both worlds, is fast for all regular users, slow only for users doing questionable things, but still correct :P

How many users like this exist? Maybe programattic switch is overkill and just choosing the first 2500 is good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful responses! It's super helpful that you're keeping an eye out & adding context. The comments totally make sense and I adjusted the PR to apply the change only to the events list queries. I'm thinking it's okay to make the change here since we'd get a 500 for many distinct_ids otherwise. Let me know if you think otherwise.

So... order by is ascending by default and id is numeric...

So this is the oldest 2500 IDs

I initially thought this is what we want here, since many distinct_ids are usually caused by instrumentation issues and the distinct_ids belonging to a person should be among the first ones (later ones not belonging to the person or being completely random). As the query is also used for the person events page, we want last ids for events as well. Therefore I'm unioning a mix of both now.

(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.

I've been told we don't use CH (join) here, since it's much slower in this case. A programatic switch sounds like a good idea and I heard there are other ideas for mitigating the problem in the future. However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.

I also heard we want to get rid of the table in CH and am wondering wether we could use the CH PostgreSQL table engine to make join queries instead. A quick POC seems promising.

How many users like this exist? Maybe programmatic switch is overkill and just choosing the first 2500 is good enough

So, I did dig a bit into the data and it seems ~1,5% of teams have persons with more than 2.500 distinct_ids and ~0.001% of persons are affected.

For a sampling of data the curve is flattening quickly:

500 1000 2000 2500 3000 3500 4000 4500
15835 8726 4143 3123 2403 1934 1538 1252

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ace 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.

🤘

]

# :DEPRECATED: This should happen through the plugin server
Expand Down
Loading