-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414
Conversation
561f090
to
d89468e
Compare
posthog/models/person/person.py
Outdated
@@ -34,7 +36,7 @@ def distinct_ids(self) -> List[str]: | |||
id[0] | |||
for id in PersonDistinctId.objects.filter(person=self, team_id=self.team_id) | |||
.order_by("id") | |||
.values_list("distinct_id") | |||
.values_list("distinct_id")[:MAX_LIMIT_DISTINCT_IDS] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... order by is ascending by default and id is numeric...
So this is the oldest 2500 IDs
So, what happens if I am interested in ID 2501.
E.g. "load me all events for person X" now becomes "load me all events for the first 2500 ids attached to person X"
I can't think of a good solution for this without person on events in place. And since distinct id to person mapping isn't in ClickHouse (I assume) you can't easily join
But this affects everyone querying the Person model even if they want all of the distinct ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.
Yeah, don't like this solution too much, querying for 10,000+ distinct ids isn't really the issue as far as I can tell, it's when you pass it in somewhere that bad things happen (like in another query, or returning via API).
This also leads to an issue of not being able to trust person.distinct_ids
to return the right things.
Sooo.. not sure what's best here, but if this is just for events view for a given person, I'd scope the change to the event_query_runner.. And if this query runs only on a given person's page, maybe the join is ok to do, since this doesn't affect the default event query, which has no personID filter? Wouldn't be as fast definitely, but correctness seems more important here, since events can be 'orphaned' otherwise? Or, programatically switch between the optimal & non-optimal version depending on no. of distinct ids. Less than 2000, go for the fast version, since all ids will be included. More, do the join, best of both worlds, is fast for all regular users, slow only for users doing questionable things, but still correct :P
How many users like this exist? Maybe programattic switch is overkill and just choosing the first 2500 is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughtful responses! It's super helpful that you're keeping an eye out & adding context. The comments totally make sense and I adjusted the PR to apply the change only to the events list queries. I'm thinking it's okay to make the change here since we'd get a 500 for many distinct_ids otherwise. Let me know if you think otherwise.
So... order by is ascending by default and id is numeric...
So this is the oldest 2500 IDs
I initially thought this is what we want here, since many distinct_ids are usually caused by instrumentation issues and the distinct_ids belonging to a person should be among the first ones (later ones not belonging to the person or being completely random). As the query is also used for the person events page, we want last ids for events as well. Therefore I'm unioning a mix of both now.
(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.
I've been told we don't use CH (join) here, since it's much slower in this case. A programatic switch sounds like a good idea and I heard there are other ideas for mitigating the problem in the future. However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.
I also heard we want to get rid of the table in CH and am wondering wether we could use the CH PostgreSQL table engine to make join queries instead. A quick POC seems promising.
How many users like this exist? Maybe programmatic switch is overkill and just choosing the first 2500 is good enough
So, I did dig a bit into the data and it seems ~1,5% of teams have persons with more than 2.500 distinct_ids and ~0.001% of persons are affected.
For a sampling of data the curve is flattening quickly:
500 | 1000 | 2000 | 2500 | 3000 | 3500 | 4000 | 4500 |
---|---|---|---|---|---|---|---|
15835 | 8726 | 4143 | 3123 | 2403 | 1934 | 1538 | 1252 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ace 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.
🤘
6343577
to
deee8bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd worry we're swapping one subtle bug for another... but also seems like you all are aware this could be better.
Totally onboard with getting over the initial bug and we've got context captured in the comment and this PR 🥇
Problem
See #18253.
Changes
This adds a max limit for distinct_ids when they are used in a cross-db subquery. ClickHouse queries are limited to 256kb size, so we could fit around 5k distinct_ids (assuming 46 bytes average size). This PR limits the distinct_ids to 2500.
How did you test this code?
CI run & Django shell