perf: Use GLOBAL IN
for session ID subselect from events in recordings list
#26889
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Since the
session_replay_events
table is a distributed table, queries against that table result inN
queries being executed whereN
is the number of shards in the cluster. When theWHERE
clause of that query includes a subquery (like on the right side of anIN
), that subquery is too independently evaluated on allN
shards.When the table selected from in that subquery is also a distributed table (e.g.
events
), the subquery — which was already repeatedN
times over — requires issuing anotherN
queries (soN^2
total) to evaluate the distributed query. This results in the same subquery being issued many times over on the cluster: each shard will end up evaluating their share of the subqueryN
times due to requests from other shards. If this query is an expensive one (includes a join, callsJSONExtract
functions, etc), this can be particularly problematic as it can quickly consume a significant amount of cluster resources.This behavior can be seen here: https://metabase.prod-eu.posthog.dev/question/304-look-up-query-by-query-id-coordinator-and-data-nodes?query_id=1589_None_xe41tCfA&include_query_start=Yes
More background context here: https://posthog.slack.com/archives/C076E99B152/p1734023417743289
Changes
This changes the
events
subquery used in this query to useGLOBAL IN
instead ofIN
. This causes theevents
subquery to be evaluated first by the initiator, instead of being left to handle by the shards. That result set is then sent to the shards along with the distributed query, avoiding the need for each shard to evaluate the subqueries independently and avoiding much of the redundant work.The behavior with
GLOBAL JOIN
can be seen here: https://metabase.prod-eu.posthog.dev/question/304-look-up-query-by-query-id-coordinator-and-data-nodes?query_id=396f4947-26e3-4cb0-bb56-c533f3f59aee&include_query_start=Yes - Note that thedistributed_depth
is one level shallower here, in addition to the total number of queries being executed being significantly fewer (in addition to the query duration, rows read, etc being much better.)Does this work well for both Cloud and self-hosted?
Yes?
How did you test this code?
Updated snapshots, checked that snapshots are consistent with ad hoc query modifications tested on production.