-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414
Merged
+45
−5
Merged
Changes from 2 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
d89468e
fix(persons): limit to maximum of 2500 distinct_ids for cross-db join
thmsobrmlr 0eec9e9
Update query snapshots
github-actions[bot] deee8bc
scope max distinct ids change to cross-db queries
thmsobrmlr 4972299
Merge branch 'master' into max-limit-distinct-ids
thmsobrmlr 0c55adc
whitespace
thmsobrmlr 777336f
use first and last distinct_ids
thmsobrmlr 9dfd662
Update query snapshots
github-actions[bot] 8d9956c
Update query snapshots
github-actions[bot] 158aa64
types
thmsobrmlr 79e2981
Update query snapshots
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... order by is ascending by default and id is numeric...
So this is the oldest 2500 IDs
So, what happens if I am interested in ID 2501.
E.g. "load me all events for person X" now becomes "load me all events for the first 2500 ids attached to person X"
I can't think of a good solution for this without person on events in place. And since distinct id to person mapping isn't in ClickHouse (I assume) you can't easily join
But this affects everyone querying the Person model even if they want all of the distinct ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.
Yeah, don't like this solution too much, querying for 10,000+ distinct ids isn't really the issue as far as I can tell, it's when you pass it in somewhere that bad things happen (like in another query, or returning via API).
This also leads to an issue of not being able to trust
person.distinct_ids
to return the right things.Sooo.. not sure what's best here, but if this is just for events view for a given person, I'd scope the change to the event_query_runner.. And if this query runs only on a given person's page, maybe the join is ok to do, since this doesn't affect the default event query, which has no personID filter? Wouldn't be as fast definitely, but correctness seems more important here, since events can be 'orphaned' otherwise? Or, programatically switch between the optimal & non-optimal version depending on no. of distinct ids. Less than 2000, go for the fast version, since all ids will be included. More, do the join, best of both worlds, is fast for all regular users, slow only for users doing questionable things, but still correct :P
How many users like this exist? Maybe programattic switch is overkill and just choosing the first 2500 is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughtful responses! It's super helpful that you're keeping an eye out & adding context. The comments totally make sense and I adjusted the PR to apply the change only to the events list queries. I'm thinking it's okay to make the change here since we'd get a 500 for many distinct_ids otherwise. Let me know if you think otherwise.
I initially thought this is what we want here, since many distinct_ids are usually caused by instrumentation issues and the distinct_ids belonging to a person should be among the first ones (later ones not belonging to the person or being completely random). As the query is also used for the person events page, we want last ids for events as well. Therefore I'm unioning a mix of both now.
I've been told we don't use CH (join) here, since it's much slower in this case. A programatic switch sounds like a good idea and I heard there are other ideas for mitigating the problem in the future. However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.
I also heard we want to get rid of the table in CH and am wondering wether we could use the CH PostgreSQL table engine to make join queries instead. A quick POC seems promising.
So, I did dig a bit into the data and it seems ~1,5% of teams have persons with more than 2.500 distinct_ids and ~0.001% of persons are affected.
For a sampling of data the curve is flattening quickly:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ace 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤘