-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support reading from property groups on events #24171
Conversation
e188d4d
to
f961fb6
Compare
# for this property group, an empty string (the default | ||
# value for the `String` type) is returned. Since that | ||
# is a valid property value, we need to check it here. | ||
materialized_property_sql = f"has({printed_column}, {printed_property_name}) ? {printed_column}[{printed_property_name}] : null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self for next week: this has
expression causes the bloom filter indices to be skipped when a property ends up in the WHERE
clause of a query:
EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE (events.team_id = 1) AND ifNull(if(has(events.properties_group_custom, 'file_type'), events.properties_group_custom['file_type'], NULL) = 'audio/vorbis', 0)
LIMIT 0, 101
Query id: d4bf9ebb-5cb0-4f4e-a218-a163621af497
┌─explain────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ Limit (preliminary LIMIT (without OFFSET)) │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ ReadFromMergeTree (default.sharded_events) │
│ Indexes: │
│ MinMax │
│ Condition: true │
│ Parts: 10/10 │
│ Granules: 13/13 │
│ Partition │
│ Condition: true │
│ Parts: 10/10 │
│ Granules: 13/13 │
│ PrimaryKey │
│ Keys: │
│ team_id │
│ Condition: (team_id in [1, 1]) │
│ Parts: 10/10 │
│ Granules: 13/13 │
└────────────────────────────────────────────────────┘
20 rows in set. Elapsed: 0.007 sec.
versus
EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE (events.team_id = 1) AND ((events.properties_group_custom['file_type']) = 'audio/vorbis')
LIMIT 0, 101
Query id: 772d0b63-a13a-4f6f-878e-c42f553b608d
┌─explain─────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ Limit (preliminary LIMIT (without OFFSET)) │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ ReadFromMergeTree (default.sharded_events) │
│ Indexes: │
│ MinMax │
│ Condition: true │
│ Parts: 10/10 │
│ Granules: 13/13 │
│ Partition │
│ Condition: true │
│ Parts: 10/10 │
│ Granules: 13/13 │
│ PrimaryKey │
│ Keys: │
│ team_id │
│ Condition: (team_id in [1, 1]) │
│ Parts: 10/10 │
│ Granules: 13/13 │
│ Skip │
│ Name: properties_group_custom_keys_bf │
│ Description: bloom_filter GRANULARITY 1 │
│ Parts: 9/10 │
│ Granules: 12/13 │
│ Skip │
│ Name: properties_group_custom_values_bf │
│ Description: bloom_filter GRANULARITY 1 │
│ Parts: 2/9 │
│ Granules: 2/12 │
└─────────────────────────────────────────────────────┘
30 rows in set. Elapsed: 0.012 sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The draft for optimizing these expressions that impact property group index usage is here: #24381
634b006
to
bdfa5b0
Compare
f63fa8a
to
73d0083
Compare
5ce0e23
to
b1161db
Compare
Size Change: 0 B Total Size: 1.07 MB ℹ️ View Unchanged
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
7f69528
to
4d60b20
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
@mariusandra - I added you here as a reviewer as the "general HogQL guy" but feel free to redirect this to somebody else if you'd like. My confidence my understanding of the AST is pretty tenuous, so those parts could probably use a bit of extra attention. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@@ -204,6 +204,7 @@ export interface HogQLQueryModifiers { | |||
personsJoinMode?: 'inner' | 'left' | |||
bounceRatePageViewMode?: 'count_pageviews' | 'uniq_urls' | |||
sessionTableVersion?: 'auto' | 'v1' | 'v2' | |||
propertyGroupsMode?: 'enabled' | 'disabled' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Representing a boolean as an Enum reads a bit silly here, but there is a good reason this is the way it is: this is to support adding a third value that enables property groups with expression rewriting for performance reasons (#24381) without adding yet another modifier or needing to change the type of this field (since changing it could cause problems for any teams that had values set explicitly in Team.modifiers
.)
I left auto
out of this field (in contrast with some of the above values) since this field is already nullable. I also left this out of set_default_modifier_values
, since adding new values there invalidates the entire query cache (see b2050df.)
This comment was marked as outdated.
This comment was marked as outdated.
0972e7f
to
9d573d1
Compare
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated3 snapshot changes in total. 0 added, 3 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated2 snapshot changes in total. 0 added, 2 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated2 snapshot changes in total. 0 added, 2 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated2 snapshot changes in total. 0 added, 2 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
Problem
See #24152 for a thorough description of what problem we're trying to solve here and an overview of what property groups are and why we expect them to improve query performance for queries that reference currently non-materialized properties.
Changes
This enables reading applicable properties from materialized property groups when the
usePropertyGroups
query modifier is provided. (The modifier itself can be enabled for a team via theTeam.modifiers
field added in #22048, so no need to add a bespoke feature flag here for testing right now.)PropertyGroupManager
for determining what property groups contain a provided property key.Expected Impact
Property access for parts that have had the property group materialized should be significantly faster, since those queries will now read less data, avoid JSON parsing, and skip processing property values with
replaceRegexpAll(nullIf(nullIf(value, ''), 'null'), '^\"|\"$', '')
since string values in the property group mapping do not contain leading or trailing quotes.Property access for parts that have not had the property group materialized may also be faster — especially for queries that reference multiple properties in the same group, since the properties JSON will only be parsed once during query execution per property group — but I didn't test this directly. Like above, the
replaceRegexpAll
call is also skipped in this case which can have a notable impact.The queries generated here cannot utilize the bloom filter indexes yet on the materialized columns — support for that will come in a follow-up change to special case comparison operations that include a property group as one of the operands so that the bloom filter indexes can be used and/or the number of map subcolumns that need to be read can be reduced.
Does this work well for both Cloud and self-hosted?
Yes.
How did you test this code?