-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added the first iteration of trend insghts breakdowns #17891
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll dig deeper in a bit... however seeing that you split out a query_builder.py
, one consideration to keep in mind is that we should find a way to have a to_persons_query()
method somewhere in there. This method return a special version of the same query, that just returns one column: distinct person_id
.
This will be used to power the new "persons modal", which opens when you click on any datapoint to drill deeper.
We have a very simple version of this working for lifecycles:
posthog/posthog/hogql_queries/insights/lifecycle_query_runner.py
Lines 75 to 86 in 7dfab45
def to_persons_query(self) -> ast.SelectQuery | ast.SelectUnionQuery: | |
# TODO: add support for selecting and filtering by breakdowns | |
with self.timings.measure("persons_query"): | |
return parse_select( | |
""" | |
SELECT | |
person_id, start_of_period as breakdown_1, status as breakdown_2 | |
FROM | |
{events_query} | |
""", | |
placeholders={"events_query": self.events_query}, | |
) |
returning
users from Tuesday, but all users from all the time).
Without going deep, I'm now not sure if that's an easy to add extra, or something that might require a rethink of the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick comment on one thing before I run to a meeting. Will review more after.
""", | ||
placeholders={ | ||
"events_where": self._where_filter(), | ||
"team_id": ast.Constant(value=self.team.pk), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the team_id
field in HogQL, it gets added automatically. Seems like this placeholder is not in use here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh yes, unsure why this is here - but v clever that it gets added automatically. Does this apply to every query on events
?
def _where_filter(self) -> ast.Expr: | ||
filters: List[ast.Expr] = [] | ||
|
||
filters.append(parse_expr("team_id = {team_id}", placeholders={"team_id": ast.Constant(value=self.team.pk)})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this can be omitted.
@mariusandra I've added support for histogram breakdowns now with the latest commit: 4b57daf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, finally got time for a proper look. Left comments below 👍
|
||
def _get_breakdown_buckets_ast(self) -> ast.Array: | ||
buckets = self._get_breakdown_histogram_buckets() | ||
values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets)) | |
values = [f"[{t[0]},{t[1]}]" for t in buckets] |
return self.enabled and self.query.breakdown.breakdown_histogram_bin_count is not None | ||
|
||
def placeholders(self): | ||
values = self._get_breakdown_buckets_ast() if self.is_histogram_breakdown else self._get_breakdown_values_ast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two look inconsistent:
self._get_breakdown_buckets_ast()
self._get_breakdown_values_ast
I'd remove the _get
from the property
|
||
@cached_property | ||
def _get_breakdown_values_ast(self) -> ast.Array: | ||
return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've usually seen map
written as a inline list loop in Python:
return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values))) | |
return ast.Array(exprs=[ast.Constant(value=v) for v in self._breakdown_values]) |
parse_expr( | ||
"toTimeZone(timestamp, 'UTC') >= {date_from}", | ||
placeholders=self.query_date_range.to_placeholders(), | ||
), | ||
parse_expr( | ||
"toTimeZone(timestamp, 'UTC') <= {date_to}", | ||
placeholders=self.query_date_range.to_placeholders(), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit the timezone call and just write timestamp
. A wrapper with the correct timezone will be added when the field is printed later.
def _get_breakdown_values(self) -> ast.Array: | ||
breakdown = BreakdownValues( | ||
team=self.team, | ||
event_name=series_event_name(self.series), | ||
breakdown_field=self.query.breakdown.breakdown, | ||
query_date_range=self.query_date_range, | ||
histogram_bin_count=self.query.breakdown.breakdown_histogram_bin_count, | ||
) | ||
return breakdown.get_breakdown_values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to time and label this separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand a bit on what you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I meant that this query is making an internal query, so something like with self.timings.measure('do the breakdown dance')
around it would make sense
for i in range(self.histogram_bin_count + 1): | ||
quantiles.append(i * bin_size) | ||
|
||
qunatile_expression = f"quantiles({','.join([f'{quantile:.2f}' for quantile in quantiles])})(value)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this .2f
precision enough for a large bin count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken from the current implementation, seems like it's been good enough in the past 🤷
parse_expr( | ||
"(toTimeZone(timestamp, 'UTC') >= {date_from})", | ||
placeholders=self.query_date_range.to_placeholders(), | ||
), | ||
parse_expr( | ||
"(toTimeZone(timestamp, 'UTC') <= {date_to})", | ||
placeholders=self.query_date_range.to_placeholders(), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment re timezones
if self._breakdown.enabled and not self._breakdown.is_histogram_breakdown: | ||
filters.append(self._breakdown.events_where_filter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... won't this just exclude everything that's not under the top N breakdown options? Is this how the current query worked? I'd assume we'd like everything else to be returned under a "other" blob? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation only takes the top 25 results, ordered by count()
at the moment. No bucketing via an "other" blob
def to_persons_query(self) -> str: | ||
# TODO: add support for selecting and filtering by breakdowns | ||
raise NotImplementedError() | ||
def to_query(self) -> List[ast.SelectQuery]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parent query runner returns just one ast.SelectQuery
in the to_query
method. This should power the "to hogql" button next to insights in the interface. Returning multiple queries breaks that flow.
Is there some way to return one query from here, even if just a large union all
... and the runner will keep running individual queries?
Also, one other consideration: when we move to CH Cloud, we'll have access to a lot of parallelisation. This runner will run each query serially. Not blocking for now, but maybe we'll want to move the merge and formula into clickhouse somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be keen to change this in an upcoming PR - can add it to my list of things TODO. It's been like this from the beginning of the trends query runner.
I'm expecting some insights won't run as a UNION ALL
from the "to hogql" button due to the size of the query itself. Imagine 25 x of these queries in a single union all. If we want the to_query
method to return a single query, then cool, but we'll likely need to break it down into each SELECT
before passing them to the execute
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but ClickHouse Cloud should ideally then parallelise those 25 queries and get the results much faster than us doing the loop in Python ever will.
@cached_property | ||
def _event_properties(self): | ||
event_property_values = PropertyDefinition.objects.filter( | ||
team_id=self.team.pk, | ||
type__in=[None, PropertyDefinition.Type.EVENT], | ||
).values_list("name", "property_type") | ||
|
||
event_properties = {name: property_type for name, property_type in event_property_values if property_type} | ||
|
||
return event_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's fair - have changed this to a get
on the individual field as opposed to get all and then filter
@mariusandra PR fixes have been pushed - ready for a second review 🥳 |
688e6e1
to
16febec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Added the first iteration of trend insghts breakdowns * Removed team id from all events filters * Added support for histogram breakdowns * PR fixes * Support group breakdowns * Abstract property chain into a cached prop
Problem
Continue translating the trends insights to use HogQL from PostHog/meta#130
Changes
string
based event breakdown properties right now, other types to come nextHow did you test this code?
By clicking around in localhost.. testing to be added once full breakdown support is finished