Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Performance: Problems and Potential Optimizations #26651

Open
tkaemming opened this issue Dec 4, 2024 · 0 comments
Open

Query Performance: Problems and Potential Optimizations #26651

tkaemming opened this issue Dec 4, 2024 · 0 comments
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@tkaemming
Copy link
Contributor

tkaemming commented Dec 4, 2024

In Progress

Reduce Column-Level Property Materializations

Unused/rarely used materialized columns add unnecessary load on ingestion and slow down mutations.

https://metabase.prod-us.posthog.dev/question/885-materialized-column-usage?lookback_window=1%20week&order_by=Times%20Used

Ideally there should be very few of these that are used by a small number of teams or used infrequently. Property groups should be used to fill the gaps instead. It'd be helpful to have some tooling to evaluate the difference in performance when using property groups for a sample of previously seen queries that use columns that are candidates for dematerialization before disabling and dropping them.

There are also a number of "orphaned" materialized columns that exist on the sharded tables but not on the distributed tables (on both clusters) that should be cleaned up: https://posthog.slack.com/archives/C076R4753Q8/p1734643551572469?thread_ts=1733523845.296439&cid=C076R4753Q8

Make Materialized Columns Nullable

# TODO: rematerialize all columns to properly support empty strings and "null" string values.
if self.context.modifiers.materializationMode == MaterializationMode.LEGACY_NULL_AS_STRING:
materialized_property_sql = f"nullIf({materialized_property_source}, '')"
else: # MaterializationMode AUTO or LEGACY_NULL_AS_NULL
materialized_property_sql = f"nullIf(nullIf({materialized_property_source}, ''), 'null')"

Removing the need to process these columns should slightly improve query performance, and also unlock some additional improvements described below.

Nullable columns are the default for new columns as of #26448, but columns that were created prior to that change will need to be rematerialized as nullable.

Rematerializing Columns as Nullable

Rematerializing these columns will require some minor refactoring to materialized column management code, since right now a lot of it assumes that a property will only be materialized into once destination column.

We'll need to have both the old non-nullable representation and new nullable representation side by side on the same table while the new representation is being backfilled. The old column should continue to be prioritized for use when querying until the new column is suitably backfilled.

Blocked

Additional Column-Level Property Materializations

Many of the slow queries that are not caused by wide time ranges or large joins are because they fall back to $-prefixed properties that are not part of a property group as they should be materialized anyway.

This is blocked until:

  1. the number of rarely used materialized columns is reduced (to avoid adding additional load, see above),
  2. materialized columns can be made nullable (to avoid needing to rematerialize these later, see above)

It may also make sense to create another property group for less often used $-prefixed properties to fill any gaps (see below.) We may also want to materialize columns out of the property group, if available, rather than JSON parsing the properties field.

Many of these columns should also be able to be used to speed up the sessions materialized views.

Improve Indexes on Column-Level Materialized Properties

There is likely a lot of potential for using data skipping indexes on some frequently used properties, like $current_url.

Blocked until materialized columns that can benefit from index creation are recreated as nullable since these calls confuse the analyzer and will cause indexes not to be used.

This will also require some work similar to #24381 to improve the likelihood that any new indexes are used.

Identifying the best indexes to use on these columns beyond "seems like it might help" may be a challenge without some more sophisticated analysis (we need to be able to identify what operators are applied to various columns in filtering steps of ReadFromMergeTree.)

Create Additional Property Groups

We can get better index utilization and remove JSON parsing overhead from many queries by adding property groups to other properties columns:

  1. event.person_properties
  2. person.properties (not indexable due to aggregation)
  3. group.properties (not indexable due to aggregation)

This is blocked until the overall number of materialized columns is decreased to avoid increasing load on ingestion. This will likely consume a lot of space and the migration will be time consuming.

We'll also need a way to mark property groups as disabled while backfilling (probably using column metadata like with column-level materialized properties) as falling back to the default expression is a pretty costly operation.

For Later

Re-evaluate Person Override Squashing

We haven't been running these jobs to avoid increasing cluster load due to mutations, so the override table is growing.

Convert Internal Users Filter to Use Cohorts

For teams that don't use person properties on events, performing a JOIN on the person table is much slower when compared to filtering out people within a cohort, especially since the cohort being excluded often contains a very small number of people.

@tkaemming tkaemming added the performance Has to do with performance. For PRs, runs the clickhouse query performance suite label Dec 4, 2024
@tkaemming tkaemming changed the title Known Query Performance Problems Query Performance: Problems and Potential Optimizations Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

No branches or pull requests

1 participant