-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hogql): use join for "in cohort" operations instead of subquery #17354
Conversation
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
This PR was closed due to lack of activity. Feel free to reopen if it's still relevant. |
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
Problem
A user has complained that their retention query that uses cohorts times out. It seems like we're using inefficient subqueries when checking for cohort membership such as
If we instead use a join such as
the query will complete as fast as a sonic hedgehog.
Changes
This will not fix the problematic retention query, as it's written in the old style. However we'll have it ported soon, and this will make a difference in other HogQL queries.
This PR adds a HogQL modifier to select cohorts either via a join or a subquery.
When via join, the generated hogql code looks like:
The
IN COHORT
comparison is then swapped out within_cohort__2.matched = 1
.I chose a LEFT join instead of an
INNER JOIN
because we don't know how deeply nested thewhere ... in cohort ...
comparison is. It could bewhere person_id in cohort 2
or it could bewhere true or person_id in cohort 2
. Doing an inner join would have automatically removed all non-matching rows from the second example, even though it actually shouldn't matter considering the where statement.Performance
Locally with limited data, the new query is sometimes faster
However globally, the older one is faster for our team:
I believe the results will be reversed when looking at a big user. To be on the safe side, I did not enable this query modification for anyone.
Next steps: testing the new and old queries to see which is faster when... and then enabling the new one for those cases.
How did you test this code?
Wrote tests