Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added the first iteration of trend insghts breakdowns #17891

Merged
merged 6 commits into from
Oct 13, 2023

Conversation

Gilbert09
Copy link
Member

Problem

Continue translating the trends insights to use HogQL from PostHog/meta#130

Changes

  • Adding support for being able to breakdown a trends series by an events properties (e.g. device id, or browser)
  • Pulled out the query-building logic from the main query runner and into it's own class
  • Only supports string based event breakdown properties right now, other types to come next

Screenshot 2023-10-10 at 14 42 30

How did you test this code?

By clicking around in localhost.. testing to be added once full breakdown support is finished

@Gilbert09 Gilbert09 requested review from mariusandra and a team October 10, 2023 13:52
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll dig deeper in a bit... however seeing that you split out a query_builder.py, one consideration to keep in mind is that we should find a way to have a to_persons_query() method somewhere in there. This method return a special version of the same query, that just returns one column: distinct person_id.

This will be used to power the new "persons modal", which opens when you click on any datapoint to drill deeper.

We have a very simple version of this working for lifecycles:

def to_persons_query(self) -> ast.SelectQuery | ast.SelectUnionQuery:
# TODO: add support for selecting and filtering by breakdowns
with self.timings.measure("persons_query"):
return parse_select(
"""
SELECT
person_id, start_of_period as breakdown_1, status as breakdown_2
FROM
{events_query}
""",
placeholders={"events_query": self.events_query},
)
... though still without any breakdown support (e.g. you don't see all returning users from Tuesday, but all users from all the time).

Without going deep, I'm now not sure if that's an easy to add extra, or something that might require a rethink of the builder.

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick comment on one thing before I run to a meeting. Will review more after.

""",
placeholders={
"events_where": self._where_filter(),
"team_id": ast.Constant(value=self.team.pk),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the team_id field in HogQL, it gets added automatically. Seems like this placeholder is not in use here anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh yes, unsure why this is here - but v clever that it gets added automatically. Does this apply to every query on events?

def _where_filter(self) -> ast.Expr:
filters: List[ast.Expr] = []

filters.append(parse_expr("team_id = {team_id}", placeholders={"team_id": ast.Constant(value=self.team.pk)}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this can be omitted.

@Gilbert09
Copy link
Member Author

Gilbert09 commented Oct 12, 2023

@mariusandra I've added support for histogram breakdowns now with the latest commit: 4b57daf

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, finally got time for a proper look. Left comments below 👍


def _get_breakdown_buckets_ast(self) -> ast.Array:
buckets = self._get_breakdown_histogram_buckets()
values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets))
values = [f"[{t[0]},{t[1]}]" for t in buckets]

return self.enabled and self.query.breakdown.breakdown_histogram_bin_count is not None

def placeholders(self):
values = self._get_breakdown_buckets_ast() if self.is_histogram_breakdown else self._get_breakdown_values_ast
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two look inconsistent:

self._get_breakdown_buckets_ast()
self._get_breakdown_values_ast

I'd remove the _get from the property


@cached_property
def _get_breakdown_values_ast(self) -> ast.Array:
return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've usually seen map written as a inline list loop in Python:

Suggested change
return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values)))
return ast.Array(exprs=[ast.Constant(value=v) for v in self._breakdown_values])

Comment on lines 75 to 98
parse_expr(
"toTimeZone(timestamp, 'UTC') >= {date_from}",
placeholders=self.query_date_range.to_placeholders(),
),
parse_expr(
"toTimeZone(timestamp, 'UTC') <= {date_to}",
placeholders=self.query_date_range.to_placeholders(),
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit the timezone call and just write timestamp. A wrapper with the correct timezone will be added when the field is printed later.

Comment on lines 63 to 89
def _get_breakdown_values(self) -> ast.Array:
breakdown = BreakdownValues(
team=self.team,
event_name=series_event_name(self.series),
breakdown_field=self.query.breakdown.breakdown,
query_date_range=self.query_date_range,
histogram_bin_count=self.query.breakdown.breakdown_histogram_bin_count,
)
return breakdown.get_breakdown_values()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to time and label this separately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand a bit on what you mean here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I meant that this query is making an internal query, so something like with self.timings.measure('do the breakdown dance') around it would make sense

for i in range(self.histogram_bin_count + 1):
quantiles.append(i * bin_size)

qunatile_expression = f"quantiles({','.join([f'{quantile:.2f}' for quantile in quantiles])})(value)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this .2f precision enough for a large bin count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken from the current implementation, seems like it's been good enough in the past 🤷

Comment on lines 170 to 177
parse_expr(
"(toTimeZone(timestamp, 'UTC') >= {date_from})",
placeholders=self.query_date_range.to_placeholders(),
),
parse_expr(
"(toTimeZone(timestamp, 'UTC') <= {date_to})",
placeholders=self.query_date_range.to_placeholders(),
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re timezones

Comment on lines +207 to +208
if self._breakdown.enabled and not self._breakdown.is_histogram_breakdown:
filters.append(self._breakdown.events_where_filter())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... won't this just exclude everything that's not under the top N breakdown options? Is this how the current query worked? I'd assume we'd like everything else to be returned under a "other" blob? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation only takes the top 25 results, ordered by count() at the moment. No bucketing via an "other" blob

def to_persons_query(self) -> str:
# TODO: add support for selecting and filtering by breakdowns
raise NotImplementedError()
def to_query(self) -> List[ast.SelectQuery]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent query runner returns just one ast.SelectQuery in the to_query method. This should power the "to hogql" button next to insights in the interface. Returning multiple queries breaks that flow.

Is there some way to return one query from here, even if just a large union all... and the runner will keep running individual queries?

Also, one other consideration: when we move to CH Cloud, we'll have access to a lot of parallelisation. This runner will run each query serially. Not blocking for now, but maybe we'll want to move the merge and formula into clickhouse somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be keen to change this in an upcoming PR - can add it to my list of things TODO. It's been like this from the beginning of the trends query runner.

I'm expecting some insights won't run as a UNION ALL from the "to hogql" button due to the size of the query itself. Imagine 25 x of these queries in a single union all. If we want the to_query method to return a single query, then cool, but we'll likely need to break it down into each SELECT before passing them to the execute method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but ClickHouse Cloud should ideally then parallelise those 25 queries and get the results much faster than us doing the loop in Python ever will.

Comment on lines 209 to 218
@cached_property
def _event_properties(self):
event_property_values = PropertyDefinition.objects.filter(
team_id=self.team.pk,
type__in=[None, PropertyDefinition.Type.EVENT],
).values_list("name", "property_type")

event_properties = {name: property_type for name, property_type in event_property_values if property_type}

return event_properties
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some teams, like ours, have millions of event properties due to old plugins. Maybe we don't want all of them in memory 😬

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's fair - have changed this to a get on the individual field as opposed to get all and then filter

@Gilbert09
Copy link
Member Author

@mariusandra PR fixes have been pushed - ready for a second review 🥳

@Gilbert09 Gilbert09 force-pushed the feat/trends-breakdown branch from 688e6e1 to 16febec Compare October 12, 2023 16:29
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Gilbert09 Gilbert09 merged commit 1ecf289 into master Oct 13, 2023
66 checks passed
@Gilbert09 Gilbert09 deleted the feat/trends-breakdown branch October 13, 2023 14:20
daibhin pushed a commit that referenced this pull request Oct 23, 2023
* Added the first iteration of trend insghts breakdowns

* Removed team id from all events filters

* Added support for histogram breakdowns

* PR fixes

* Support group breakdowns

* Abstract property chain into a cached prop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants