Query Performance: Problems and Potential Optimizations #26651

tkaemming · 2024-12-04T17:25:06Z

In Progress

Reduce Column-Level Property Materializations

Unused/rarely used materialized columns add unnecessary load on ingestion and slow down mutations.

https://metabase.prod-us.posthog.dev/question/885-materialized-column-usage?lookback_window=1%20week&order_by=Times%20Used

Ideally there should be very few of these that are used by a small number of teams or used infrequently. Property groups should be used to fill the gaps instead. It'd be helpful to have some tooling to evaluate the difference in performance when using property groups for a sample of previously seen queries that use columns that are candidates for dematerialization before disabling and dropping them.

There are also a number of "orphaned" materialized columns that exist on the sharded tables but not on the distributed tables (on both clusters) that should be cleaned up: https://posthog.slack.com/archives/C076R4753Q8/p1734643551572469?thread_ts=1733523845.296439&cid=C076R4753Q8

Make Materialized Columns Nullable

posthog/posthog/hogql/printer.py

Lines 1365 to 1369 in 8eef695

    
           # TODO: rematerialize all columns to properly support empty strings and "null" string values. 
        
           if self.context.modifiers.materializationMode == MaterializationMode.LEGACY_NULL_AS_STRING: 
        
               materialized_property_sql = f"nullIf({materialized_property_source}, '')" 
        
           else:  # MaterializationMode AUTO or LEGACY_NULL_AS_NULL 
        
               materialized_property_sql = f"nullIf(nullIf({materialized_property_source}, ''), 'null')"

Removing the need to process these columns should slightly improve query performance, and also unlock some additional improvements described below.

Nullable columns are the default for new columns as of #26448, but columns that were created prior to that change will need to be rematerialized as nullable.

Rematerializing Columns as Nullable

Rematerializing these columns will require some minor refactoring to materialized column management code, since right now a lot of it assumes that a property will only be materialized into once destination column.

We'll need to have both the old non-nullable representation and new nullable representation side by side on the same table while the new representation is being backfilled. The old column should continue to be prioritized for use when querying until the new column is suitably backfilled.

Blocked

Additional Column-Level Property Materializations

Many of the slow queries that are not caused by wide time ranges or large joins are because they fall back to $-prefixed properties that are not part of a property group as they should be materialized anyway.

This is blocked until:

the number of rarely used materialized columns is reduced (to avoid adding additional load, see above),
~~materialized columns can be made nullable (to avoid needing to rematerialize these later, see above)~~

It may also make sense to create another property group for less often used $-prefixed properties to fill any gaps (see below.) We may also want to materialize columns out of the property group, if available, rather than JSON parsing the properties field.

Many of these columns should also be able to be used to speed up the sessions materialized views.

Improve Indexes on Column-Level Materialized Properties

There is likely a lot of potential for using data skipping indexes on some frequently used properties, like $current_url.

Blocked until materialized columns that can benefit from index creation are recreated as nullable since these calls confuse the analyzer and will cause indexes not to be used.

This will also require some work similar to #24381 to improve the likelihood that any new indexes are used.

Identifying the best indexes to use on these columns beyond "seems like it might help" may be a challenge without some more sophisticated analysis (we need to be able to identify what operators are applied to various columns in filtering steps of ReadFromMergeTree.)

Create Additional Property Groups

We can get better index utilization and remove JSON parsing overhead from many queries by adding property groups to other properties columns:

event.person_properties
person.properties (not indexable due to aggregation)
group.properties (not indexable due to aggregation)

This is blocked until the overall number of materialized columns is decreased to avoid increasing load on ingestion. This will likely consume a lot of space and the migration will be time consuming.

We'll also need a way to mark property groups as disabled while backfilling (probably using column metadata like with column-level materialized properties) as falling back to the default expression is a pretty costly operation.

For Later

Re-evaluate Person Override Squashing

We haven't been running these jobs to avoid increasing cluster load due to mutations, so the override table is growing.

Convert Internal Users Filter to Use Cohorts

For teams that don't use person properties on events, performing a JOIN on the person table is much slower when compared to filtering out people within a cohort, especially since the cohort being excluded often contains a very small number of people.

The text was updated successfully, but these errors were encountered:

tkaemming added the performance Has to do with performance. For PRs, runs the clickhouse query performance suite label Dec 4, 2024

This was referenced Dec 5, 2024

perf: Only drop columns and indexes associated with materialized columns if they exist #26664

Merged

refactor: Backfill materialized columns by column reference instead of property #26742

Merged

tkaemming changed the title ~~Known Query Performance Problems~~ Query Performance: Problems and Potential Optimizations Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Performance: Problems and Potential Optimizations #26651

Query Performance: Problems and Potential Optimizations #26651

tkaemming commented Dec 4, 2024 •

edited

Loading

Query Performance: Problems and Potential Optimizations #26651

Query Performance: Problems and Potential Optimizations #26651

Comments

tkaemming commented Dec 4, 2024 • edited Loading

In Progress

Reduce Column-Level Property Materializations

Make Materialized Columns Nullable

Rematerializing Columns as Nullable

Blocked

Additional Column-Level Property Materializations

Improve Indexes on Column-Level Materialized Properties

Create Additional Property Groups

For Later

Re-evaluate Person Override Squashing

Convert Internal Users Filter to Use Cohorts

tkaemming commented Dec 4, 2024 •

edited

Loading