Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059

Merged
merged 12 commits into from
Mar 22, 2024

Conversation

tkaemming
Copy link
Contributor

@tkaemming tkaemming commented Mar 20, 2024

Problem

See #20460 and PostHog/product-internal#557 for context.

Changes

This adds support for join-based distinct ID person overrides for HogQL queries. (This doesn't turn on the querying path for any teams, it just makes it available in /debug.)

v3 is a bit of a hybrid of the non-PoE and PoE v2 approach (the table is basically the same as the person_distinct_id2 table, just with fewer rows in it like v2 overrides) so a lot of this code is largely lifted and adapted from those existing versions.

We'll also want to test this with a dictionary once it's ready (#20971) to see how much/if there is a practical performance benefit there as well, but that can happen independently as a separate mode (e.g. v3_enabled_dict or similar.)

Does this work well for both Cloud and self-hosted?

Yep!

How did you test this code?

Added tests, also manually ran some queries with /debug and verified results.

Copy link
Contributor

github-actions bot commented Mar 20, 2024

Size Change: 0 B

Total Size: 824 kB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 824 kB

compressed-size-action

@tkaemming
Copy link
Contributor Author

tkaemming commented Mar 20, 2024

Note there is actually a bug here (at least what I'd consider one!) The behavior described below is consistent with how v2 is already implemented, but not consistent with non-PoE reads.

This HogQL query:

   select event,
          person.id,
          count()
     from events
    where {filters} -- replaced with global date and property filters
      and person.properties.email is not null
 group by event,
          person.id
 order by count() desc
    limit 100

… generates this ClickHouse SQL with v3, which looks generally correct:

SELECT
    events.event AS event,
    if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id) AS id,
    count()
FROM
    events
    LEFT OUTER JOIN (SELECT
        argMax(person_distinct_id_overrides.person_id, person_distinct_id_overrides.version) AS person_id,
        person_distinct_id_overrides.distinct_id AS distinct_id
    FROM
        person_distinct_id_overrides
    WHERE
        equals(person_distinct_id_overrides.team_id, 1)
    GROUP BY
        person_distinct_id_overrides.distinct_id
    HAVING
        ifNull(equals(argMax(person_distinct_id_overrides.is_deleted, person_distinct_id_overrides.version), 0), 0)) AS events__override ON equals(events.distinct_id, events__override.distinct_id)
WHERE
    and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_0)s), toDateTime64('2024-03-19 20:49:35.386761', 6, 'UTC')), isNotNull(replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(events.person_properties, %(hogql_val_1)s), ''), 'null'), '^"|"$', '')))
GROUP BY
    events.event,
    if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id)
ORDER BY
    count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1

The non-PoE query looks like this:

SELECT
    events.event AS event,
    events__pdi__person.id AS id,
    count()
FROM
    events
    INNER JOIN (SELECT
        argMax(person_distinct_id2.person_id, person_distinct_id2.version) AS events__pdi___person_id,
        argMax(person_distinct_id2.person_id, person_distinct_id2.version) AS person_id,
        person_distinct_id2.distinct_id AS distinct_id
    FROM
        person_distinct_id2
    WHERE
        equals(person_distinct_id2.team_id, 1)
    GROUP BY
        person_distinct_id2.distinct_id
    HAVING
        ifNull(equals(argMax(person_distinct_id2.is_deleted, person_distinct_id2.version), 0), 0)) AS events__pdi ON equals(events.distinct_id, events__pdi.distinct_id)
    INNER JOIN (SELECT
        person.id AS id,
        replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(person.properties, %(hogql_val_0)s), ''), 'null'), '^"|"$', '') AS properties___email
    FROM
        person
    WHERE
        and(equals(person.team_id, 1), ifNull(in(tuple(person.id, person.version), (SELECT
                    person.id AS id,
                    max(person.version) AS version
                FROM
                    person
                WHERE
                    equals(person.team_id, 1)
                GROUP BY
                    person.id
                HAVING
                    ifNull(equals(argMax(person.is_deleted, person.version), 0), 0))), 0))
    SETTINGS optimize_aggregation_in_order=1) AS events__pdi__person ON equals(events__pdi.events__pdi___person_id, events__pdi__person.id)
WHERE
    and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_1)s), toDateTime64('2024-03-19 21:32:47.452490', 6, 'UTC')), isNotNull(events__pdi__person.properties___email))
GROUP BY
    events.event,
    events__pdi__person.id
ORDER BY
    count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1

Since v3 is override based, we need to use a LEFT OUTER JOIN since there won't be a RHS row for every LHS row. The non-PoE query differs in that it uses a INNER JOIN which ensures that rows that were excluded from the RHS due to person deletions (via the distinct ID deletion cascade) are excluded from the LHS/full result set. So, all PoE versions (1, 2, and 3) will return deleted events until they are deleted from the events table by the periodic async deletion job (where they might not be deleted anyway depending on whether or not the person_id that is targeted for deletion is accurate.)

If we want to make the v3 reads behavior consistent with non-PoE reads, we'd need to:

  1. perform the join on every events query (regardless of whether or not the query uses the person_id otherwise),
  2. remove the HAVING from SELECT for events__override (so that deleted rows are not filtered out)
  3. add a WHERE condition to the outer events query to filter out rows associated with a deleted person/distinct_id

Additionally, this non-PoE HogQL query

   select event,
          count()
     from events
    where {filters} -- replaced with global date and property filters
 group by event
 order by count() desc
    limit 100

… results in this ClickHouse SQL, which doesn't account for deletions either:

SELECT
    events.event AS event,
    count()
FROM
    events
WHERE
    and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_0)s), toDateTime64('2024-03-19 23:50:22.313916', 6, 'UTC')))
GROUP BY
    events.event
ORDER BY
    count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1

… so I'm not sure what the expectations should be here, if there are any (maybe the behavior is just undefined?)

elif modifiers.personsOnEventsMode == PersonsOnEventsMode.v3_enabled:
database.events.fields["event_person_id"] = StringDatabaseField(name="person_id")
database.events.fields["override"] = LazyJoin(
from_field=["distinct_id"], # ???
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined on 182 above as well, but it wasn't obvious to me at a glance where this gets used (I didn't do a ton of digging either though to be fair.) I figured this would be used for the LHS of the join comparator, but it seems like that's defined on the JoinExpr itself (i.e. the return value of join_with_person_distinct_id_overrides_table)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's essentially for this:

# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed.
# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is largely copy/pasted from the non-_overrides suffixed version, because the table itself is basically the same, just with fewer rows.

Comment on lines +55 to +56
join_expr = ast.JoinExpr(table=select_from_person_distinct_id_overrides_table(requested_fields))
join_expr.join_type = "LEFT OUTER JOIN"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we are able to stay reasonably on top of merges, something that could be interesting to check out here would be using an ANY merge in place of the argMax to see if that has any performance impact. I suspect it wouldn't be worth it since it'd increase cardinality of the RHS table and has the potential to create nondeterminism headaches if there are lots of rows with duplicate keys, but hard to say that for sure without trying.

@tkaemming tkaemming requested a review from mariusandra March 20, 2024 21:48
@tkaemming
Copy link
Contributor Author

@mariusandra - I don't think this is ready yet due to the issue noted in this comment #21059 (comment), but I'm interested in your opinion if this is the right direction overall, as well as if you have any clever ideas on how to address the problem mentioned in that (long) comment. I'd assume that injecting that predicate into all events queries would be doable, but maybe there's a better approach?

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and seems to work. Left some comments inline, and there are some more mypy errors to check.

Regarding deleted persons. That's something I never gave a lot of thought to until now. I guess there are two options:

  1. Subquery all the things (slow?)
  2. Automagically add where filters

The second option is definitely doable. We do add the team_id=42 filter automagically, but that's done at the last possible stage (the printer) in order to make it hard to avoid. If we want added filters to potentially trigger lazy joins, they should be done early on in the resolver, the first time this block visits any table. Ideally we'd abstract this into a LazyCondition that you could just attach to any database table, and it'd work.

The bigger question for me is, what's the performance penalty for joining and checking for deletes in an average events query? We might not want to pay it, and would prefer some nightly scripts instead? 🤔 🤷

elif modifiers.personsOnEventsMode == PersonsOnEventsMode.v3_enabled:
database.events.fields["event_person_id"] = StringDatabaseField(name="person_id")
database.events.fields["override"] = LazyJoin(
from_field=["distinct_id"], # ???
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's essentially for this:

# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed.
# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join.

start=None,
),
)
database.events.fields["poe"].fields["id"] = database.events.fields["person_id"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some mypy errors here

@tkaemming
Copy link
Contributor Author

[…] there are some more mypy errors to check.

Since the errors basically came from variations on the same ideas that were copy/pasted to support v3 and those original versions also had the same issues (you can see them next to the added lines in 0e6dba4), I just went ahead and added these to the baseline file, as I'm not sure the alternatives I could think of would make much sense (adding a cast or # mypy: ignore seem worse than adding it to the baseline file, assert isinstance(…) to convince the checker it's ok won't work with the parameterized generics, and resolving the fundamental typing issues here doesn't seem reasonable in the context of this change.)

Let me know what you think — if you strongly disagree with this, I can look into an alternative, but this seems like the least worst path forward with these errors to me.

Regarding deleted persons. That's something I never gave a lot of thought to until now. I guess there are two options:

  1. Subquery all the things (slow?)
  2. Automagically add where filters

The second option is definitely doable. We do add the team_id=42 filter automagically, but that's done at the last possible stage (the printer) in order to make it hard to avoid. If we want added filters to potentially trigger lazy joins, they should be done early on in the resolver, the first time this block visits any table. Ideally we'd abstract this into a LazyCondition that you could just attach to any database table, and it'd work.

This makes sense and is helpful context. I also have performance concerns about this as well.

The bigger question for me is, what's the performance penalty for joining and checking for deletes in an average events query?

Good question, and I'm not really sure. For queries that already include person_id, I'd expect the impact to not be too severe — we're already loading the column and performing the join, so I'd expect the filter expression to be of negligible cost.

If we wanted to ensure that events associated with a person were never returned, that'd be more expensive since it always requires the join, even on queries that don't actually use the person_id in the result set. That seems like the more correct thing to do… but we don't do that on non-PoE reads now, so maybe it's not a practical priority?

So, I guess there are three options here:

  1. Keep this as-is (don't handle deletions at all): this would be a regression from non-PoE results but consistent with other PoE versions (this also means the cardinality of a result set could change depending on whether or not you're using person properties from the event, or from the person due to the different join strategies, which seems particularly bad)
  2. Add the where filters only when person_id is somewhere in the query: this should yield consistent results with existing non-PoE queries
  3. Add the where filters always, regardless of whether or not person_id is used: this yields the most correct result set, but is inconsistent with existing non-PoE queries which don't consider this case. Not sure what the performance implications of this would be in practice.

I think my preference would be to start with option 1 (keep this as-is) for the sake of testing, and then plan on looking into the other options as a follow-up, comparing before/after performance on real data sets. Seem reasonable?

We might not want to pay it, and would prefer some nightly scripts instead? 🤔 🤷

We do have the job to delete events associated with deleted persons that runs weekly, but it is currently broken (but being fixed soon, see #20221) so that at least helps address the issue over time (but also maybe creates its own issues with surprising data changes when a delete job runs long after the deletion request occurred…) I'm not sure that we'd be able to run that job more frequently than weekly, but maybe @bretthoerner or @fuziontech has more context/opinions on that.

@posthog-bot

This comment was marked as outdated.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@tkaemming tkaemming marked this pull request as ready for review March 22, 2024 01:13
@mariusandra
Copy link
Collaborator

All of that makes sense for me. Regarding deletes, starting with the simplest option (do nothing special today) also makes sense to me.

Going forward, we could make "exclude deleted persons" a HogQL modifier, which someone could toggle on/off based on their needs on the accurate vs fast spectrum. Defaulting everyone to "fast".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants