Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Use GLOBAL IN for session ID subselect from events in recordings list #26889

Merged
merged 7 commits into from
Dec 13, 2024

Conversation

tkaemming
Copy link
Contributor

@tkaemming tkaemming commented Dec 13, 2024

Problem

Since the session_replay_events table is a distributed table, queries against that table result in N queries being executed where N is the number of shards in the cluster. When the WHERE clause of that query includes a subquery (like on the right side of an IN), that subquery is too independently evaluated on all N shards.

When the table selected from in that subquery is also a distributed table (e.g. events), the subquery — which was already repeated N times over — requires issuing another N queries (so N^2 total) to evaluate the distributed query. This results in the same subquery being issued many times over on the cluster: each shard will end up evaluating their share of the subquery N times due to requests from other shards. If this query is an expensive one (includes a join, calls JSONExtract functions, etc), this can be particularly problematic as it can quickly consume a significant amount of cluster resources.

This behavior can be seen here: https://metabase.prod-eu.posthog.dev/question/304-look-up-query-by-query-id-coordinator-and-data-nodes?query_id=1589_None_xe41tCfA&include_query_start=Yes

More background context here: https://posthog.slack.com/archives/C076E99B152/p1734023417743289

Changes

This changes the events subquery used in this query to use GLOBAL IN instead of IN. This causes the events subquery to be evaluated first by the initiator, instead of being left to handle by the shards. That result set is then sent to the shards along with the distributed query, avoiding the need for each shard to evaluate the subqueries independently and avoiding much of the redundant work.

The behavior with GLOBAL JOIN can be seen here: https://metabase.prod-eu.posthog.dev/question/304-look-up-query-by-query-id-coordinator-and-data-nodes?query_id=396f4947-26e3-4cb0-bb56-c533f3f59aee&include_query_start=Yes - Note that the distributed_depth is one level shallower here, in addition to the total number of queries being executed being significantly fewer (in addition to the query duration, rows read, etc being much better.)

Does this work well for both Cloud and self-hosted?

Yes?

How did you test this code?

Updated snapshots, checked that snapshots are consistent with ad hoc query modifications tested on production.

@tkaemming tkaemming marked this pull request as ready for review December 13, 2024 03:25
@tkaemming tkaemming requested review from pauldambra and a team December 13, 2024 03:25
Copy link
Member

@pauldambra pauldambra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ratio of work to size of change is way off on this PR 🤯 🧠 😍

absolutely wild!

@pauldambra pauldambra enabled auto-merge (squash) December 13, 2024 11:52
@pauldambra pauldambra merged commit f8c5303 into master Dec 13, 2024
89 checks passed
@pauldambra pauldambra deleted the sessions-global-in branch December 13, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants