Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): use join for "in cohort" operations instead of subquery #17354

Merged
merged 10 commits into from
Oct 18, 2023

Conversation

mariusandra
Copy link
Collaborator

@mariusandra mariusandra commented Sep 7, 2023

Problem

A user has complained that their retention query that uses cohorts times out. It seems like we're using inefficient subqueries when checking for cohort membership such as

AND pdi.person_id IN (
  SELECT DISTINCT person_id FROM cohortpeople WHERE team_id = $1 AND cohort_id = $2 AND version = $3
)

If we instead use a join such as

INNER JOIN 
(
  SELECT DISTINCT person_id FROM cohortpeople WHERE team_id = $1 AND cohort_id = $2 AND version = $3
) cohortperson ON cohortperson.person_id = person.id

the query will complete as fast as a sonic hedgehog.

Changes

This will not fix the problematic retention query, as it's written in the old style. However we'll have it ported soon, and this will make a difference in other HogQL queries.

This PR adds a HogQL modifier to select cohorts either via a join or a subquery.

When via join, the generated hogql code looks like:

LEFT JOIN (SELECT person_id, 1 as matched FROM static_cohort_people WHERE cohort_id = 2) 
AS in_cohort__2
ON events.person_id = in_cohort__2.person_id 

The IN COHORT comparison is then swapped out with in_cohort__2.matched = 1.

I chose a LEFT join instead of an INNER JOIN because we don't know how deeply nested the where ... in cohort ... comparison is. It could be where person_id in cohort 2 or it could be where true or person_id in cohort 2. Doing an inner join would have automatically removed all non-matching rows from the second example, even though it actually shouldn't matter considering the where statement.

Performance

Locally with limited data, the new query is sometimes faster

image

However globally, the older one is faster for our team:

Screenshot 2023-10-17 at 16 17 50

I believe the results will be reversed when looking at a big user. To be on the safe side, I did not enable this query modification for anyone.

Next steps: testing the new and old queries to see which is faster when... and then enabling the new one for those cases.

How did you test this code?

Wrote tests

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@posthog-bot
Copy link
Contributor

This PR was closed due to lack of activity. Feel free to reopen if it's still relevant.

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@mariusandra mariusandra marked this pull request as ready for review October 17, 2023 14:40
@mariusandra mariusandra merged commit f930132 into master Oct 18, 2023
73 checks passed
@mariusandra mariusandra deleted the hogql-in-cohort-join branch October 18, 2023 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants