feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059

tkaemming · 2024-03-20T21:16:19Z

Problem

See #20460 and PostHog/product-internal#557 for context.

Changes

This adds support for join-based distinct ID person overrides for HogQL queries. (This doesn't turn on the querying path for any teams, it just makes it available in /debug.)

v3 is a bit of a hybrid of the non-PoE and PoE v2 approach (the table is basically the same as the person_distinct_id2 table, just with fewer rows in it like v2 overrides) so a lot of this code is largely lifted and adapted from those existing versions.

We'll also want to test this with a dictionary once it's ready (#20971) to see how much/if there is a practical performance benefit there as well, but that can happen independently as a separate mode (e.g. v3_enabled_dict or similar.)

Does this work well for both Cloud and self-hosted?

Yep!

How did you test this code?

Added tests, also manually ran some queries with /debug and verified results.

github-actions · 2024-03-20T21:36:51Z

Size Change: 0 B

Total Size: 824 kB

ℹ️ View Unchanged

Filename	Size
`frontend/dist/toolbar.js`	824 kB

_{compressed-size-action}

tkaemming · 2024-03-20T21:46:11Z

Note there is actually a bug here (at least what I'd consider one!) The behavior described below is consistent with how v2 is already implemented, but not consistent with non-PoE reads.

This HogQL query:

   select event,
          person.id,
          count()
     from events
    where {filters} -- replaced with global date and property filters
      and person.properties.email is not null
 group by event,
          person.id
 order by count() desc
    limit 100

… generates this ClickHouse SQL with v3, which looks generally correct:

SELECT
    events.event AS event,
    if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id) AS id,
    count()
FROM
    events
    LEFT OUTER JOIN (SELECT
        argMax(person_distinct_id_overrides.person_id, person_distinct_id_overrides.version) AS person_id,
        person_distinct_id_overrides.distinct_id AS distinct_id
    FROM
        person_distinct_id_overrides
    WHERE
        equals(person_distinct_id_overrides.team_id, 1)
    GROUP BY
        person_distinct_id_overrides.distinct_id
    HAVING
        ifNull(equals(argMax(person_distinct_id_overrides.is_deleted, person_distinct_id_overrides.version), 0), 0)) AS events__override ON equals(events.distinct_id, events__override.distinct_id)
WHERE
    and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_0)s), toDateTime64('2024-03-19 20:49:35.386761', 6, 'UTC')), isNotNull(replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(events.person_properties, %(hogql_val_1)s), ''), 'null'), '^"|"$', '')))
GROUP BY
    events.event,
    if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id)
ORDER BY
    count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1

The non-PoE query looks like this:

SELECT
    events.event AS event,
    events__pdi__person.id AS id,
    count()
FROM
    events
    INNER JOIN (SELECT
        argMax(person_distinct_id2.person_id, person_distinct_id2.version) AS events__pdi___person_id,
        argMax(person_distinct_id2.person_id, person_distinct_id2.version) AS person_id,
        person_distinct_id2.distinct_id AS distinct_id
    FROM
        person_distinct_id2
    WHERE
        equals(person_distinct_id2.team_id, 1)
    GROUP BY
        person_distinct_id2.distinct_id
    HAVING
        ifNull(equals(argMax(person_distinct_id2.is_deleted, person_distinct_id2.version), 0), 0)) AS events__pdi ON equals(events.distinct_id, events__pdi.distinct_id)
    INNER JOIN (SELECT
        person.id AS id,
        replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(person.properties, %(hogql_val_0)s), ''), 'null'), '^"|"$', '') AS properties___email
    FROM
        person
    WHERE
        and(equals(person.team_id, 1), ifNull(in(tuple(person.id, person.version), (SELECT
                    person.id AS id,
                    max(person.version) AS version
                FROM
                    person
                WHERE
                    equals(person.team_id, 1)
                GROUP BY
                    person.id
                HAVING
                    ifNull(equals(argMax(person.is_deleted, person.version), 0), 0))), 0))
    SETTINGS optimize_aggregation_in_order=1) AS events__pdi__person ON equals(events__pdi.events__pdi___person_id, events__pdi__person.id)
WHERE
    and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_1)s), toDateTime64('2024-03-19 21:32:47.452490', 6, 'UTC')), isNotNull(events__pdi__person.properties___email))
GROUP BY
    events.event,
    events__pdi__person.id
ORDER BY
    count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1

Since v3 is override based, we need to use a LEFT OUTER JOIN since there won't be a RHS row for every LHS row. The non-PoE query differs in that it uses a INNER JOIN which ensures that rows that were excluded from the RHS due to person deletions (via the distinct ID deletion cascade) are excluded from the LHS/full result set. So, all PoE versions (1, 2, and 3) will return deleted events until they are deleted from the events table by the periodic async deletion job (where they might not be deleted anyway depending on whether or not the person_id that is targeted for deletion is accurate.)

If we want to make the v3 reads behavior consistent with non-PoE reads, we'd need to:

perform the join on every events query (regardless of whether or not the query uses the person_id otherwise),
remove the HAVING from SELECT for events__override (so that deleted rows are not filtered out)
add a WHERE condition to the outer events query to filter out rows associated with a deleted person/distinct_id

Additionally, this non-PoE HogQL query

   select event,
          count()
     from events
    where {filters} -- replaced with global date and property filters
 group by event
 order by count() desc
    limit 100

… results in this ClickHouse SQL, which doesn't account for deletions either:

SELECT
    events.event AS event,
    count()
FROM
    events
WHERE
    and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_0)s), toDateTime64('2024-03-19 23:50:22.313916', 6, 'UTC')))
GROUP BY
    events.event
ORDER BY
    count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1

… so I'm not sure what the expectations should be here, if there are any (maybe the behavior is just undefined?)

tkaemming · 2024-03-20T21:21:44Z

posthog/hogql/database/database.py

+    elif modifiers.personsOnEventsMode == PersonsOnEventsMode.v3_enabled:
+        database.events.fields["event_person_id"] = StringDatabaseField(name="person_id")
+        database.events.fields["override"] = LazyJoin(
+            from_field=["distinct_id"],  # ???


This is defined on 182 above as well, but it wasn't obvious to me at a glance where this gets used (I didn't do a ton of digging either though to be fair.) I figured this would be used for the LHS of the join comparator, but it seems like that's defined on the JoinExpr itself (i.e. the return value of join_with_person_distinct_id_overrides_table)?

It's essentially for this:

posthog/posthog/hogql/transforms/lazy_tables.py

Lines 281 to 282 in 4a60cad

# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed.

# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join.

tkaemming · 2024-03-20T21:25:31Z

posthog/hogql/database/schema/person_distinct_id_overrides.py

This is largely copy/pasted from the non-_overrides suffixed version, because the table itself is basically the same, just with fewer rows.

tkaemming · 2024-03-20T21:29:22Z

posthog/hogql/database/schema/person_distinct_id_overrides.py

+    join_expr = ast.JoinExpr(table=select_from_person_distinct_id_overrides_table(requested_fields))
+    join_expr.join_type = "LEFT OUTER JOIN"


Assuming we are able to stay reasonably on top of merges, something that could be interesting to check out here would be using an ANY merge in place of the argMax to see if that has any performance impact. I suspect it wouldn't be worth it since it'd increase cardinality of the RHS table and has the potential to create nondeterminism headaches if there are lots of rows with duplicate keys, but hard to say that for sure without trying.

tkaemming · 2024-03-20T21:51:19Z

@mariusandra - I don't think this is ready yet due to the issue noted in this comment #21059 (comment), but I'm interested in your opinion if this is the right direction overall, as well as if you have any clever ideas on how to address the problem mentioned in that (long) comment. I'd assume that injecting that predicate into all events queries would be doable, but maybe there's a better approach?

mariusandra

Looks good and seems to work. Left some comments inline, and there are some more mypy errors to check.

Regarding deleted persons. That's something I never gave a lot of thought to until now. I guess there are two options:

Subquery all the things (slow?)
Automagically add where filters

The second option is definitely doable. We do add the team_id=42 filter automagically, but that's done at the last possible stage (the printer) in order to make it hard to avoid. If we want added filters to potentially trigger lazy joins, they should be done early on in the resolver, the first time this block visits any table. Ideally we'd abstract this into a LazyCondition that you could just attach to any database table, and it'd work.

The bigger question for me is, what's the performance penalty for joining and checking for deletes in an average events query? We might not want to pay it, and would prefer some nightly scripts instead? 🤔 🤷

mariusandra · 2024-03-21T10:22:01Z

posthog/hogql/database/database.py

+    elif modifiers.personsOnEventsMode == PersonsOnEventsMode.v3_enabled:
+        database.events.fields["event_person_id"] = StringDatabaseField(name="person_id")
+        database.events.fields["override"] = LazyJoin(
+            from_field=["distinct_id"],  # ???


It's essentially for this:

posthog/posthog/hogql/transforms/lazy_tables.py

Lines 281 to 282 in 4a60cad

# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed.

# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join.

mariusandra · 2024-03-21T10:22:19Z

posthog/hogql/database/database.py

+                start=None,
+            ),
+        )
+        database.events.fields["poe"].fields["id"] = database.events.fields["person_id"]


Some mypy errors here

tkaemming · 2024-03-22T00:02:43Z

[…] there are some more mypy errors to check.

Since the errors basically came from variations on the same ideas that were copy/pasted to support v3 and those original versions also had the same issues (you can see them next to the added lines in 0e6dba4), I just went ahead and added these to the baseline file, as I'm not sure the alternatives I could think of would make much sense (adding a cast or # mypy: ignore seem worse than adding it to the baseline file, assert isinstance(…) to convince the checker it's ok won't work with the parameterized generics, and resolving the fundamental typing issues here doesn't seem reasonable in the context of this change.)

Let me know what you think — if you strongly disagree with this, I can look into an alternative, but this seems like the least worst path forward with these errors to me.

Regarding deleted persons. That's something I never gave a lot of thought to until now. I guess there are two options:

Subquery all the things (slow?)

Automagically add where filters

The second option is definitely doable. We do add the team_id=42 filter automagically, but that's done at the last possible stage (the printer) in order to make it hard to avoid. If we want added filters to potentially trigger lazy joins, they should be done early on in the resolver, the first time this block visits any table. Ideally we'd abstract this into a LazyCondition that you could just attach to any database table, and it'd work.

This makes sense and is helpful context. I also have performance concerns about this as well.

The bigger question for me is, what's the performance penalty for joining and checking for deletes in an average events query?

Good question, and I'm not really sure. For queries that already include person_id, I'd expect the impact to not be too severe — we're already loading the column and performing the join, so I'd expect the filter expression to be of negligible cost.

If we wanted to ensure that events associated with a person were never returned, that'd be more expensive since it always requires the join, even on queries that don't actually use the person_id in the result set. That seems like the more correct thing to do… but we don't do that on non-PoE reads now, so maybe it's not a practical priority?

So, I guess there are three options here:

Keep this as-is (don't handle deletions at all): this would be a regression from non-PoE results but consistent with other PoE versions (this also means the cardinality of a result set could change depending on whether or not you're using person properties from the event, or from the person due to the different join strategies, which seems particularly bad)
Add the where filters only when person_id is somewhere in the query: this should yield consistent results with existing non-PoE queries
Add the where filters always, regardless of whether or not person_id is used: this yields the most correct result set, but is inconsistent with existing non-PoE queries which don't consider this case. Not sure what the performance implications of this would be in practice.

I think my preference would be to start with option 1 (keep this as-is) for the sake of testing, and then plan on looking into the other options as a follow-up, comparing before/after performance on real data sets. Seem reasonable?

We might not want to pay it, and would prefer some nightly scripts instead? 🤔 🤷

We do have the job to delete events associated with deleted persons that runs weekly, but it is currently broken (but being fixed soon, see #20221) so that at least helps address the issue over time (but also maybe creates its own issues with surprising data changes when a delete job runs long after the deletion request occurred…) I'm not sure that we'd be able to run that job more frequently than weekly, but maybe @bretthoerner or @fuziontech has more context/opinions on that.

posthog-bot · 2024-03-22T00:55:25Z

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

mariusandra · 2024-03-22T08:16:06Z

All of that makes sense for me. Regarding deletes, starting with the simplest option (do nothing special today) also makes sense to me.

Going forward, we could make "exclude deleted persons" a HogQL modifier, which someone could toggle on/off based on their needs on the accurate vs fast spectrum. Defaulting everyone to "fast".

tkaemming and others added 7 commits March 19, 2024 18:08

Add v3 to PersonsOnEventsMode.

d64dce1

Handle v3 override in create_hogql_database.

a52e8f0

Add person_distinct_id_overrides table.

08de7dd

Steal field definitions for person.id from v2_enabled.

5f3dd1e

Add test.

f9a0e04

Note that this is v3 using a join (as opposed to the dictionary.)

2fdc164

Update query snapshots

44d2c82

tkaemming commented Mar 20, 2024

View reviewed changes

tkaemming requested a review from mariusandra March 20, 2024 21:48

mariusandra approved these changes Mar 21, 2024

View reviewed changes

tkaemming added 2 commits March 21, 2024 16:09

Cowardly add new errors to mypy-baseline.

0e6dba4

Merge branch 'master' into poe-v3-hogql-join

8d287ef

This is what happens when you don't read comments

7912ed4

tkaemming force-pushed the poe-v3-hogql-join branch from 5426cc7 to 7912ed4 Compare March 22, 2024 00:40

Update UI snapshots for chromium (2)

d110123

This comment was marked as outdated.

Sign in to view

Update UI snapshots for chromium (2)

e18b9fd

tkaemming marked this pull request as ready for review March 22, 2024 01:13

tkaemming merged commit 060ebc2 into master Mar 22, 2024
95 checks passed

tkaemming deleted the poe-v3-hogql-join branch March 22, 2024 17:59

daibhin mentioned this pull request Dec 9, 2024

feat: aggregate exception events for overrides #26762

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059

feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059

tkaemming commented Mar 20, 2024 •

edited

Loading

github-actions bot commented Mar 20, 2024 •

edited

Loading

tkaemming commented Mar 20, 2024 •

edited

Loading

tkaemming Mar 20, 2024

mariusandra Mar 21, 2024

tkaemming Mar 20, 2024

tkaemming Mar 20, 2024

tkaemming commented Mar 20, 2024

mariusandra left a comment

mariusandra Mar 21, 2024

mariusandra Mar 21, 2024

tkaemming commented Mar 22, 2024

This comment was marked as outdated.

posthog-bot commented Mar 22, 2024

mariusandra commented Mar 22, 2024

	# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed.
	# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join.

		join_expr = ast.JoinExpr(table=select_from_person_distinct_id_overrides_table(requested_fields))
		join_expr.join_type = "LEFT OUTER JOIN"

feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059

feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059

Conversation

tkaemming commented Mar 20, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

github-actions bot commented Mar 20, 2024 • edited Loading

tkaemming commented Mar 20, 2024 • edited Loading

tkaemming Mar 20, 2024

Choose a reason for hiding this comment

mariusandra Mar 21, 2024

Choose a reason for hiding this comment

tkaemming Mar 20, 2024

Choose a reason for hiding this comment

tkaemming Mar 20, 2024

Choose a reason for hiding this comment

tkaemming commented Mar 20, 2024

mariusandra left a comment

Choose a reason for hiding this comment

mariusandra Mar 21, 2024

Choose a reason for hiding this comment

mariusandra Mar 21, 2024

Choose a reason for hiding this comment

tkaemming commented Mar 22, 2024

This comment was marked as outdated.

posthog-bot commented Mar 22, 2024

📸 UI snapshots have been updated

mariusandra commented Mar 22, 2024

tkaemming commented Mar 20, 2024 •

edited

Loading

github-actions bot commented Mar 20, 2024 •

edited

Loading

tkaemming commented Mar 20, 2024 •

edited

Loading