perf: Use `GLOBAL IN` for session ID subselect from events in recordings list #26889

tkaemming · 2024-12-13T01:51:09Z

Problem

Since the session_replay_events table is a distributed table, queries against that table result in N queries being executed where N is the number of shards in the cluster. When the WHERE clause of that query includes a subquery (like on the right side of an IN), that subquery is too independently evaluated on all N shards.

When the table selected from in that subquery is also a distributed table (e.g. events), the subquery — which was already repeated N times over — requires issuing another N queries (so N^2 total) to evaluate the distributed query. This results in the same subquery being issued many times over on the cluster: each shard will end up evaluating their share of the subquery N times due to requests from other shards. If this query is an expensive one (includes a join, calls JSONExtract functions, etc), this can be particularly problematic as it can quickly consume a significant amount of cluster resources.

This behavior can be seen here: https://metabase.prod-eu.posthog.dev/question/304-look-up-query-by-query-id-coordinator-and-data-nodes?query_id=1589_None_xe41tCfA&include_query_start=Yes

More background context here: https://posthog.slack.com/archives/C076E99B152/p1734023417743289

Changes

This changes the events subquery used in this query to use GLOBAL IN instead of IN. This causes the events subquery to be evaluated first by the initiator, instead of being left to handle by the shards. That result set is then sent to the shards along with the distributed query, avoiding the need for each shard to evaluate the subqueries independently and avoiding much of the redundant work.

The behavior with GLOBAL JOIN can be seen here: https://metabase.prod-eu.posthog.dev/question/304-look-up-query-by-query-id-coordinator-and-data-nodes?query_id=396f4947-26e3-4cb0-bb56-c533f3f59aee&include_query_start=Yes - Note that the distributed_depth is one level shallower here, in addition to the total number of queries being executed being significantly fewer (in addition to the query duration, rows read, etc being much better.)

Does this work well for both Cloud and self-hosted?

Yes?

How did you test this code?

Updated snapshots, checked that snapshots are consistent with ad hoc query modifications tested on production.

pauldambra

ratio of work to size of change is way off on this PR 🤯 🧠 😍

absolutely wild!

tkaemming and others added 3 commits December 12, 2024 17:49

perf: Use GLOBAL IN for session ID subselect in recordings list

5a59e9e

Update query snapshots

ec1a17c

Update query snapshots

def5d49

tkaemming marked this pull request as ready for review December 13, 2024 03:25

tkaemming requested review from pauldambra and a team December 13, 2024 03:25

pauldambra approved these changes Dec 13, 2024

View reviewed changes

pauldambra and others added 4 commits December 13, 2024 08:42

and the other events subquery

302b255

Update query snapshots

e13d9f4

Update query snapshots

73bec8a

Merge branch 'master' into sessions-global-in

cbac2ed

pauldambra enabled auto-merge (squash) December 13, 2024 11:52

pauldambra merged commit f8c5303 into master Dec 13, 2024
89 checks passed

pauldambra deleted the sessions-global-in branch December 13, 2024 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use `GLOBAL IN` for session ID subselect from events in recordings list #26889

perf: Use `GLOBAL IN` for session ID subselect from events in recordings list #26889

tkaemming commented Dec 13, 2024 •

edited

Loading

pauldambra left a comment

perf: Use GLOBAL IN for session ID subselect from events in recordings list #26889

perf: Use GLOBAL IN for session ID subselect from events in recordings list #26889

Conversation

tkaemming commented Dec 13, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

pauldambra left a comment

Choose a reason for hiding this comment

perf: Use `GLOBAL IN` for session ID subselect from events in recordings list #26889

perf: Use `GLOBAL IN` for session ID subselect from events in recordings list #26889

tkaemming commented Dec 13, 2024 •

edited

Loading