Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support nullable materialized columns using native types #26448

Merged
merged 17 commits into from
Dec 5, 2024

Conversation

tkaemming
Copy link
Contributor

@tkaemming tkaemming commented Nov 26, 2024

Problem

Properties that are materialized as columns with the current implementation cannot distinguish between null values and the string value "null" and require processing when reading to lossily convert these values to nulls:

# TODO: rematerialize all columns to properly support empty strings and "null" string values.
if self.context.modifiers.materializationMode == MaterializationMode.LEGACY_NULL_AS_STRING:
materialized_property_sql = f"nullIf({materialized_property_source}, '')"
else: # MaterializationMode AUTO or LEGACY_NULL_AS_NULL
materialized_property_sql = f"nullIf(nullIf({materialized_property_source}, ''), 'null')"

Needing to transform these values when reading makes using data skipping indexes more difficult as the analyzer will not use the indexes for these expressions.

Additionally, property values are materialized as columns using the output of JSONExtractRaw with leading and trailing quotes stripped which can result in unnecessarily escaped values being contained within the stored.

Also see #19461 for details.

Changes

Adds the ability to materialize properties into Nullable(String) types. New columns added via the management command are created as nullable by default.

This uses the JSONExtract(column, key, 'Nullable(String)') form to avoid escaping errors (like as already happens with property groups):

SELECT JSONExtract('{"foo": "\\"hello\\""}', 'foo', 'Nullable(String)')

   ┌─JSONExtract('{"foo": "\\"hello\\""}', 'foo', 'Nullable(String)')─┐
1. │ "hello"                                                          │
   └──────────────────────────────────────────────────────────────────┘

Compare to:

SELECT replaceRegexpAll(JSONExtractRaw('{"foo": "\\"hello\\""}', 'foo'), '^"|"$', '')

   ┌─replaceRegexpAll(JSONExtractRaw('{"foo": "\\"hello\\""}', 'foo'), '^"|"$', '')─┐
1. │ \"hello\"                                                                      │
   └────────────────────────────────────────────────────────────────────────────────┘

Nullable columns are only used within HogQL queries as the change could affect existing legacy queries in undefined ways due to the format change as well as the introduction as null values.

Does this work well for both Cloud and self-hosted?

Yes.

How did you test this code?

Updated existing tests to account for nullable columns where relevant.

@tkaemming tkaemming changed the title feat: Support nullable materialized columns using native typesColumn types feat: Support nullable materialized columns using native types Nov 26, 2024
materialize("events", "withmat_nullable", is_nullable=True)
self.assertEqual(
self._expr("properties.withmat_nullable.json.yet", context),
"replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(events.mat_withmat_nullable, %(hogql_val_0)s, %(hogql_val_1)s), ''), 'null'), '^\"|\"$', '')",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to change this so it uses JSONExtract(events.mat_withmat_nullable, %(hogql_val_0)s, %(hogql_val_1)s, 'Nullable(String)') too (also will apply to the above), but will do that separately as it's not really coupled to this specific change and will cause a bunch of snapshot updates.

Comment on lines +12 to +14
class MaterializedColumn(Protocol):
name: ColumnName
is_nullable: bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because the interface is defined in ee/* but used throughout the non-ee/ codebase and I don't want to wrap everything in typing.TYPE_CHECKING guards.

A lot of this would be simpler if we moved it to non-ee/ but not sure what the history is here so just working around it for now.

@@ -40,15 +38,36 @@
}


class MaterializedColumn(NamedTuple):
@dataclass
class MaterializedColumn:
name: ColumnName
details: MaterializedColumnDetails
Copy link
Contributor Author

@tkaemming tkaemming Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably consolidate these two classes (MaterializedColumn & MaterializedColumnDetails) at some point (there's not much benefit to them being separate at this point) but not in a big hurry to do that.

@tkaemming tkaemming marked this pull request as ready for review December 4, 2024 22:46
@tkaemming tkaemming requested a review from a team as a code owner December 4, 2024 22:46
ADD COLUMN IF NOT EXISTS {self.column.name} VARCHAR
MATERIALIZED {TRIM_AND_EXTRACT_PROPERTY.format(table_column=self.column.details.table_column)}
""",
f"ADD COLUMN IF NOT EXISTS {self.column.name} {self.column.type} MATERIALIZED {expression}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chef-kiss:

Comment on lines 464 to 467
lib_materialized_column = get_materialized_column_for_property("events", "properties", "$lib")
lib_expression = (
lib_materialized_column.name if lib_materialized_column is not None else "JSONExtractString(properties, '$lib')"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use that get_property_string_expr method here is the only feedback I have

Copy link
Contributor Author

@tkaemming tkaemming Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the generated query is slightly different than the original, but not in a way that I think really matters all that much: 422d678

@tkaemming tkaemming merged commit cc990fd into master Dec 5, 2024
89 checks passed
@tkaemming tkaemming deleted the column-types branch December 5, 2024 22:14
Copy link

sentry-io bot commented Dec 6, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ NetworkError: Connection refused (ch-offline.posthog.net:9440) /api/projects/{parent_lookup_team_id}/events/ View Issue
  • ‼️ NetworkError: Connection refused (ch-online.posthog.net:9440) /api/environments/{parent_lookup_team_id}/query/ View Issue
  • ‼️ NetworkError: Connection refused (ch-online.posthog.net:9440) /api/person/values/ View Issue
  • ‼️ NetworkError: Connection refused (ch-online.posthog.net:9440) posthog.tasks.tasks.process_query_task View Issue
  • ‼️ NetworkError: Connection refused (ch-online.posthog.net:9440) posthog.tasks.tasks.process_query_task View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants