-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hogql): Add basic support for PoE v3 (distinct ID overrides) #21059
Conversation
Size Change: 0 B Total Size: 824 kB ℹ️ View Unchanged
|
Note there is actually a bug here (at least what I'd consider one!) The behavior described below is consistent with how v2 is already implemented, but not consistent with non-PoE reads. This HogQL query: select event,
person.id,
count()
from events
where {filters} -- replaced with global date and property filters
and person.properties.email is not null
group by event,
person.id
order by count() desc
limit 100 … generates this ClickHouse SQL with v3, which looks generally correct: SELECT
events.event AS event,
if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id) AS id,
count()
FROM
events
LEFT OUTER JOIN (SELECT
argMax(person_distinct_id_overrides.person_id, person_distinct_id_overrides.version) AS person_id,
person_distinct_id_overrides.distinct_id AS distinct_id
FROM
person_distinct_id_overrides
WHERE
equals(person_distinct_id_overrides.team_id, 1)
GROUP BY
person_distinct_id_overrides.distinct_id
HAVING
ifNull(equals(argMax(person_distinct_id_overrides.is_deleted, person_distinct_id_overrides.version), 0), 0)) AS events__override ON equals(events.distinct_id, events__override.distinct_id)
WHERE
and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_0)s), toDateTime64('2024-03-19 20:49:35.386761', 6, 'UTC')), isNotNull(replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(events.person_properties, %(hogql_val_1)s), ''), 'null'), '^"|"$', '')))
GROUP BY
events.event,
if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id)
ORDER BY
count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1 The non-PoE query looks like this: SELECT
events.event AS event,
events__pdi__person.id AS id,
count()
FROM
events
INNER JOIN (SELECT
argMax(person_distinct_id2.person_id, person_distinct_id2.version) AS events__pdi___person_id,
argMax(person_distinct_id2.person_id, person_distinct_id2.version) AS person_id,
person_distinct_id2.distinct_id AS distinct_id
FROM
person_distinct_id2
WHERE
equals(person_distinct_id2.team_id, 1)
GROUP BY
person_distinct_id2.distinct_id
HAVING
ifNull(equals(argMax(person_distinct_id2.is_deleted, person_distinct_id2.version), 0), 0)) AS events__pdi ON equals(events.distinct_id, events__pdi.distinct_id)
INNER JOIN (SELECT
person.id AS id,
replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(person.properties, %(hogql_val_0)s), ''), 'null'), '^"|"$', '') AS properties___email
FROM
person
WHERE
and(equals(person.team_id, 1), ifNull(in(tuple(person.id, person.version), (SELECT
person.id AS id,
max(person.version) AS version
FROM
person
WHERE
equals(person.team_id, 1)
GROUP BY
person.id
HAVING
ifNull(equals(argMax(person.is_deleted, person.version), 0), 0))), 0))
SETTINGS optimize_aggregation_in_order=1) AS events__pdi__person ON equals(events__pdi.events__pdi___person_id, events__pdi__person.id)
WHERE
and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_1)s), toDateTime64('2024-03-19 21:32:47.452490', 6, 'UTC')), isNotNull(events__pdi__person.properties___email))
GROUP BY
events.event,
events__pdi__person.id
ORDER BY
count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1 Since v3 is override based, we need to use a If we want to make the v3 reads behavior consistent with non-PoE reads, we'd need to:
Additionally, this non-PoE HogQL query select event,
count()
from events
where {filters} -- replaced with global date and property filters
group by event
order by count() desc
limit 100 … results in this ClickHouse SQL, which doesn't account for deletions either: SELECT
events.event AS event,
count()
FROM
events
WHERE
and(equals(events.team_id, 1), greaterOrEquals(toTimeZone(events.timestamp, %(hogql_val_0)s), toDateTime64('2024-03-19 23:50:22.313916', 6, 'UTC')))
GROUP BY
events.event
ORDER BY
count() DESC
LIMIT 100 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1 … so I'm not sure what the expectations should be here, if there are any (maybe the behavior is just undefined?) |
elif modifiers.personsOnEventsMode == PersonsOnEventsMode.v3_enabled: | ||
database.events.fields["event_person_id"] = StringDatabaseField(name="person_id") | ||
database.events.fields["override"] = LazyJoin( | ||
from_field=["distinct_id"], # ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined on 182 above as well, but it wasn't obvious to me at a glance where this gets used (I didn't do a ton of digging either though to be fair.) I figured this would be used for the LHS of the join comparator, but it seems like that's defined on the JoinExpr
itself (i.e. the return value of join_with_person_distinct_id_overrides_table
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's essentially for this:
posthog/posthog/hogql/transforms/lazy_tables.py
Lines 281 to 282 in 4a60cad
# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed. | |
# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is largely copy/pasted from the non-_overrides
suffixed version, because the table itself is basically the same, just with fewer rows.
join_expr = ast.JoinExpr(table=select_from_person_distinct_id_overrides_table(requested_fields)) | ||
join_expr.join_type = "LEFT OUTER JOIN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we are able to stay reasonably on top of merges, something that could be interesting to check out here would be using an ANY
merge in place of the argMax
to see if that has any performance impact. I suspect it wouldn't be worth it since it'd increase cardinality of the RHS table and has the potential to create nondeterminism headaches if there are lots of rows with duplicate keys, but hard to say that for sure without trying.
@mariusandra - I don't think this is ready yet due to the issue noted in this comment #21059 (comment), but I'm interested in your opinion if this is the right direction overall, as well as if you have any clever ideas on how to address the problem mentioned in that (long) comment. I'd assume that injecting that predicate into all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and seems to work. Left some comments inline, and there are some more mypy errors to check.
Regarding deleted persons. That's something I never gave a lot of thought to until now. I guess there are two options:
- Subquery all the things (slow?)
- Automagically add where filters
The second option is definitely doable. We do add the team_id=42 filter automagically, but that's done at the last possible stage (the printer) in order to make it hard to avoid. If we want added filters to potentially trigger lazy joins, they should be done early on in the resolver, the first time this block visits any table. Ideally we'd abstract this into a LazyCondition that you could just attach to any database table, and it'd work.
The bigger question for me is, what's the performance penalty for joining and checking for deletes in an average events query? We might not want to pay it, and would prefer some nightly scripts instead? 🤔 🤷
elif modifiers.personsOnEventsMode == PersonsOnEventsMode.v3_enabled: | ||
database.events.fields["event_person_id"] = StringDatabaseField(name="person_id") | ||
database.events.fields["override"] = LazyJoin( | ||
from_field=["distinct_id"], # ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's essentially for this:
posthog/posthog/hogql/transforms/lazy_tables.py
Lines 281 to 282 in 4a60cad
# Make sure we also add fields we will use for the join's "ON" condition into the list of fields accessed. | |
# Without this "pdi.person.id" won't work if you did not ALSO select "pdi.person_id" explicitly for the join. |
start=None, | ||
), | ||
) | ||
database.events.fields["poe"].fields["id"] = database.events.fields["person_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some mypy errors here
Since the errors basically came from variations on the same ideas that were copy/pasted to support v3 and those original versions also had the same issues (you can see them next to the added lines in 0e6dba4), I just went ahead and added these to the baseline file, as I'm not sure the alternatives I could think of would make much sense (adding a Let me know what you think — if you strongly disagree with this, I can look into an alternative, but this seems like the least worst path forward with these errors to me.
This makes sense and is helpful context. I also have performance concerns about this as well.
Good question, and I'm not really sure. For queries that already include If we wanted to ensure that events associated with a person were never returned, that'd be more expensive since it always requires the join, even on queries that don't actually use the So, I guess there are three options here:
I think my preference would be to start with option 1 (keep this as-is) for the sake of testing, and then plan on looking into the other options as a follow-up, comparing before/after performance on real data sets. Seem reasonable?
We do have the job to delete events associated with deleted persons that runs weekly, but it is currently broken (but being fixed soon, see #20221) so that at least helps address the issue over time (but also maybe creates its own issues with surprising data changes when a delete job runs long after the deletion request occurred…) I'm not sure that we'd be able to run that job more frequently than weekly, but maybe @bretthoerner or @fuziontech has more context/opinions on that. |
5426cc7
to
7912ed4
Compare
This comment was marked as outdated.
This comment was marked as outdated.
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
All of that makes sense for me. Regarding deletes, starting with the simplest option (do nothing special today) also makes sense to me. Going forward, we could make "exclude deleted persons" a HogQL modifier, which someone could toggle on/off based on their needs on the accurate vs fast spectrum. Defaulting everyone to "fast". |
Problem
See #20460 and PostHog/product-internal#557 for context.
Changes
This adds support for join-based distinct ID person overrides for HogQL queries. (This doesn't turn on the querying path for any teams, it just makes it available in
/debug
.)v3 is a bit of a hybrid of the non-PoE and PoE v2 approach (the table is basically the same as the
person_distinct_id2
table, just with fewer rows in it like v2 overrides) so a lot of this code is largely lifted and adapted from those existing versions.We'll also want to test this with a dictionary once it's ready (#20971) to see how much/if there is a practical performance benefit there as well, but that can happen independently as a separate mode (e.g.
v3_enabled_dict
or similar.)Does this work well for both Cloud and self-hosted?
Yep!
How did you test this code?
Added tests, also manually ran some queries with
/debug
and verified results.