You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ideally there should be very few of these that are used by a small number of teams or used infrequently. Property groups should be used to fill the gaps instead. It'd be helpful to have some tooling to evaluate the difference in performance when using property groups for a sample of previously seen queries that use columns that are candidates for dematerialization before disabling and dropping them.
Removing the need to process these columns should slightly improve query performance, and also unlock some additional improvements described below.
Nullable columns are the default for new columns as of #26448, but columns that were created prior to that change will need to be rematerialized as nullable.
Rematerializing Columns as Nullable
Rematerializing these columns will require some minor refactoring to materialized column management code, since right now a lot of it assumes that a property will only be materialized into once destination column.
We'll need to have both the old non-nullable representation and new nullable representation side by side on the same table while the new representation is being backfilled. The old column should continue to be prioritized for use when querying until the new column is suitably backfilled.
Blocked
Additional Column-Level Property Materializations
Many of the slow queries that are not caused by wide time ranges or large joins are because they fall back to $-prefixed properties that are not part of a property group as they should be materialized anyway.
This is blocked until:
the number of rarely used materialized columns is reduced (to avoid adding additional load, see above),
materialized columns can be made nullable (to avoid needing to rematerialize these later, see above)
It may also make sense to create another property group for less often used $-prefixed properties to fill any gaps (see below.) We may also want to materialize columns out of the property group, if available, rather than JSON parsing the properties field.
Many of these columns should also be able to be used to speed up the sessions materialized views.
Improve Indexes on Column-Level Materialized Properties
There is likely a lot of potential for using data skipping indexes on some frequently used properties, like $current_url.
Blocked until materialized columns that can benefit from index creation are recreated as nullable since these calls confuse the analyzer and will cause indexes not to be used.
This will also require some work similar to #24381 to improve the likelihood that any new indexes are used.
Identifying the best indexes to use on these columns beyond "seems like it might help" may be a challenge without some more sophisticated analysis (we need to be able to identify what operators are applied to various columns in filtering steps of ReadFromMergeTree.)
Create Additional Property Groups
We can get better index utilization and remove JSON parsing overhead from many queries by adding property groups to other properties columns:
event.person_properties
person.properties (not indexable due to aggregation)
group.properties (not indexable due to aggregation)
This is blocked until the overall number of materialized columns is decreased to avoid increasing load on ingestion. This will likely consume a lot of space and the migration will be time consuming.
We'll also need a way to mark property groups as disabled while backfilling (probably using column metadata like with column-level materialized properties) as falling back to the default expression is a pretty costly operation.
For Later
Re-evaluate Person Override Squashing
We haven't been running these jobs to avoid increasing cluster load due to mutations, so the override table is growing.
Convert Internal Users Filter to Use Cohorts
For teams that don't use person properties on events, performing a JOIN on the person table is much slower when compared to filtering out people within a cohort, especially since the cohort being excluded often contains a very small number of people.
The text was updated successfully, but these errors were encountered:
In Progress
Reduce Column-Level Property Materializations
Unused/rarely used materialized columns add unnecessary load on ingestion and slow down mutations.
https://metabase.prod-us.posthog.dev/question/885-materialized-column-usage?lookback_window=1%20week&order_by=Times%20Used
Ideally there should be very few of these that are used by a small number of teams or used infrequently. Property groups should be used to fill the gaps instead. It'd be helpful to have some tooling to evaluate the difference in performance when using property groups for a sample of previously seen queries that use columns that are candidates for dematerialization before disabling and dropping them.
There are also a number of "orphaned" materialized columns that exist on the sharded tables but not on the distributed tables (on both clusters) that should be cleaned up: https://posthog.slack.com/archives/C076R4753Q8/p1734643551572469?thread_ts=1733523845.296439&cid=C076R4753Q8
Make Materialized Columns Nullable
posthog/posthog/hogql/printer.py
Lines 1365 to 1369 in 8eef695
Removing the need to process these columns should slightly improve query performance, and also unlock some additional improvements described below.
Nullable columns are the default for new columns as of #26448, but columns that were created prior to that change will need to be rematerialized as nullable.
Rematerializing Columns as Nullable
Rematerializing these columns will require some minor refactoring to materialized column management code, since right now a lot of it assumes that a property will only be materialized into once destination column.
We'll need to have both the old non-nullable representation and new nullable representation side by side on the same table while the new representation is being backfilled. The old column should continue to be prioritized for use when querying until the new column is suitably backfilled.
Blocked
Additional Column-Level Property Materializations
Many of the slow queries that are not caused by wide time ranges or large joins are because they fall back to
$
-prefixed properties that are not part of a property group as they should be materialized anyway.This is blocked until:
materialized columns can be made nullable (to avoid needing to rematerialize these later, see above)It may also make sense to create another property group for less often used
$
-prefixed properties to fill any gaps (see below.) We may also want to materialize columns out of the property group, if available, rather than JSON parsing the properties field.Many of these columns should also be able to be used to speed up the sessions materialized views.
Improve Indexes on Column-Level Materialized Properties
There is likely a lot of potential for using data skipping indexes on some frequently used properties, like
$current_url
.Blocked until materialized columns that can benefit from index creation are recreated as nullable since these calls confuse the analyzer and will cause indexes not to be used.
This will also require some work similar to #24381 to improve the likelihood that any new indexes are used.
Identifying the best indexes to use on these columns beyond "seems like it might help" may be a challenge without some more sophisticated analysis (we need to be able to identify what operators are applied to various columns in filtering steps of
ReadFromMergeTree
.)Create Additional Property Groups
We can get better index utilization and remove JSON parsing overhead from many queries by adding property groups to other properties columns:
event.person_properties
person.properties
(not indexable due to aggregation)group.properties
(not indexable due to aggregation)This is blocked until the overall number of materialized columns is decreased to avoid increasing load on ingestion. This will likely consume a lot of space and the migration will be time consuming.
We'll also need a way to mark property groups as disabled while backfilling (probably using column metadata like with column-level materialized properties) as falling back to the default expression is a pretty costly operation.
For Later
Re-evaluate Person Override Squashing
We haven't been running these jobs to avoid increasing cluster load due to mutations, so the override table is growing.
Convert Internal Users Filter to Use Cohorts
For teams that don't use person properties on events, performing a JOIN on the person table is much slower when compared to filtering out people within a cohort, especially since the cohort being excluded often contains a very small number of people.
The text was updated successfully, but these errors were encountered: