feat: Added the first iteration of trend insghts breakdowns #17891

Gilbert09 · 2023-10-10T13:52:18Z

Problem

Continue translating the trends insights to use HogQL from PostHog/meta#130

Changes

Adding support for being able to breakdown a trends series by an events properties (e.g. device id, or browser)
Pulled out the query-building logic from the main query runner and into it's own class
Only supports string based event breakdown properties right now, other types to come next

How did you test this code?

By clicking around in localhost.. testing to be added once full breakdown support is finished

mariusandra

I'll dig deeper in a bit... however seeing that you split out a query_builder.py, one consideration to keep in mind is that we should find a way to have a to_persons_query() method somewhere in there. This method return a special version of the same query, that just returns one column: distinct person_id.

This will be used to power the new "persons modal", which opens when you click on any datapoint to drill deeper.

We have a very simple version of this working for lifecycles:

posthog/posthog/hogql_queries/insights/lifecycle_query_runner.py

Lines 75 to 86 in 7dfab45

    
               def to_persons_query(self) -> ast.SelectQuery | ast.SelectUnionQuery: 
        
                   # TODO: add support for selecting and filtering by breakdowns 
        
                   with self.timings.measure("persons_query"): 
        
                       return parse_select( 
        
                           """ 
        
                           SELECT 
        
                               person_id, start_of_period as breakdown_1, status as breakdown_2 
        
                           FROM 
        
                               {events_query} 
        
                           """, 
        
                           placeholders={"events_query": self.events_query}, 
        
                       )

... though still without any breakdown support (e.g. you don't see all returning users from Tuesday, but all users from all the time).

Without going deep, I'm now not sure if that's an easy to add extra, or something that might require a rethink of the builder.

mariusandra

quick comment on one thing before I run to a meeting. Will review more after.

mariusandra · 2023-10-11T09:28:57Z

posthog/hogql_queries/insights/trends/breakdown_values.py

+            """,
+            placeholders={
+                "events_where": self._where_filter(),
+                "team_id": ast.Constant(value=self.team.pk),


We don't need the team_id field in HogQL, it gets added automatically. Seems like this placeholder is not in use here anyway.

Ahhh yes, unsure why this is here - but v clever that it gets added automatically. Does this apply to every query on events?

mariusandra · 2023-10-11T09:29:35Z

posthog/hogql_queries/insights/trends/breakdown_values.py

+    def _where_filter(self) -> ast.Expr:
+        filters: List[ast.Expr] = []
+
+        filters.append(parse_expr("team_id = {team_id}", placeholders={"team_id": ast.Constant(value=self.team.pk)}))


Same here, this can be omitted.

Gilbert09 · 2023-10-12T09:52:29Z

@mariusandra I've added support for histogram breakdowns now with the latest commit: 4b57daf

mariusandra

All right, finally got time for a proper look. Left comments below 👍

mariusandra · 2023-10-12T11:22:08Z

posthog/hogql_queries/insights/trends/breakdown.py

+
+    def _get_breakdown_buckets_ast(self) -> ast.Array:
+        buckets = self._get_breakdown_histogram_buckets()
+        values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets))


Suggested change

values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets))

values = [f"[{t[0]},{t[1]}]" for t in buckets]

mariusandra · 2023-10-12T11:23:58Z

posthog/hogql_queries/insights/trends/breakdown.py

+        return self.enabled and self.query.breakdown.breakdown_histogram_bin_count is not None
+
+    def placeholders(self):
+        values = self._get_breakdown_buckets_ast() if self.is_histogram_breakdown else self._get_breakdown_values_ast


These two look inconsistent:

self._get_breakdown_buckets_ast() self._get_breakdown_values_ast

I'd remove the _get from the property

mariusandra · 2023-10-12T11:25:03Z

posthog/hogql_queries/insights/trends/breakdown.py

+
+    @cached_property
+    def _get_breakdown_values_ast(self) -> ast.Array:
+        return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values)))


I've usually seen map written as a inline list loop in Python:

Suggested change

return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values)))

return ast.Array(exprs=[ast.Constant(value=v) for v in self._breakdown_values])

mariusandra · 2023-10-12T11:27:57Z

posthog/hogql_queries/insights/trends/breakdown_values.py

+                parse_expr(
+                    "toTimeZone(timestamp, 'UTC') >= {date_from}",
+                    placeholders=self.query_date_range.to_placeholders(),
+                ),
+                parse_expr(
+                    "toTimeZone(timestamp, 'UTC') <= {date_to}",
+                    placeholders=self.query_date_range.to_placeholders(),
+                ),


You can omit the timezone call and just write timestamp. A wrapper with the correct timezone will be added when the field is printed later.

mariusandra · 2023-10-12T11:32:31Z

posthog/hogql_queries/insights/trends/breakdown.py

+    def _get_breakdown_values(self) -> ast.Array:
+        breakdown = BreakdownValues(
+            team=self.team,
+            event_name=series_event_name(self.series),
+            breakdown_field=self.query.breakdown.breakdown,
+            query_date_range=self.query_date_range,
+            histogram_bin_count=self.query.breakdown.breakdown_histogram_bin_count,
+        )
+        return breakdown.get_breakdown_values()


Would be great to time and label this separately

Could you expand a bit on what you mean here?

ah, I meant that this query is making an internal query, so something like with self.timings.measure('do the breakdown dance') around it would make sense

mariusandra · 2023-10-12T11:44:08Z

posthog/hogql_queries/insights/trends/breakdown_values.py

+            for i in range(self.histogram_bin_count + 1):
+                quantiles.append(i * bin_size)
+
+            qunatile_expression = f"quantiles({','.join([f'{quantile:.2f}' for quantile in quantiles])})(value)"


Is this .2f precision enough for a large bin count?

Taken from the current implementation, seems like it's been good enough in the past 🤷

mariusandra · 2023-10-12T11:46:15Z

posthog/hogql_queries/insights/trends/query_builder.py

+                parse_expr(
+                    "(toTimeZone(timestamp, 'UTC') >= {date_from})",
+                    placeholders=self.query_date_range.to_placeholders(),
+                ),
+                parse_expr(
+                    "(toTimeZone(timestamp, 'UTC') <= {date_to})",
+                    placeholders=self.query_date_range.to_placeholders(),
+                ),


Same comment re timezones

mariusandra · 2023-10-12T11:48:57Z

posthog/hogql_queries/insights/trends/query_builder.py

+        if self._breakdown.enabled and not self._breakdown.is_histogram_breakdown:
+            filters.append(self._breakdown.events_where_filter())


Hm... won't this just exclude everything that's not under the top N breakdown options? Is this how the current query worked? I'd assume we'd like everything else to be returned under a "other" blob? 🤔

The current implementation only takes the top 25 results, ordered by count() at the moment. No bucketing via an "other" blob

mariusandra · 2023-10-12T11:53:51Z

posthog/hogql_queries/insights/trends/trends_query_runner.py

-    def to_persons_query(self) -> str:
-        # TODO: add support for selecting and filtering by breakdowns
-        raise NotImplementedError()
+    def to_query(self) -> List[ast.SelectQuery]:


The parent query runner returns just one ast.SelectQuery in the to_query method. This should power the "to hogql" button next to insights in the interface. Returning multiple queries breaks that flow.

Is there some way to return one query from here, even if just a large union all... and the runner will keep running individual queries?

Also, one other consideration: when we move to CH Cloud, we'll have access to a lot of parallelisation. This runner will run each query serially. Not blocking for now, but maybe we'll want to move the merge and formula into clickhouse somehow.

Would be keen to change this in an upcoming PR - can add it to my list of things TODO. It's been like this from the beginning of the trends query runner.

I'm expecting some insights won't run as a UNION ALL from the "to hogql" button due to the size of the query itself. Imagine 25 x of these queries in a single union all. If we want the to_query method to return a single query, then cool, but we'll likely need to break it down into each SELECT before passing them to the execute method

Yep, but ClickHouse Cloud should ideally then parallelise those 25 queries and get the results much faster than us doing the loop in Python ever will.

mariusandra · 2023-10-12T11:59:18Z

posthog/hogql_queries/insights/trends/trends_query_runner.py

+    @cached_property
+    def _event_properties(self):
+        event_property_values = PropertyDefinition.objects.filter(
+            team_id=self.team.pk,
+            type__in=[None, PropertyDefinition.Type.EVENT],
+        ).values_list("name", "property_type")
+
+        event_properties = {name: property_type for name, property_type in event_property_values if property_type}
+
+        return event_properties


Some teams, like ours, have millions of event properties due to old plugins. Maybe we don't want all of them in memory 😬

Yep, that's fair - have changed this to a get on the individual field as opposed to get all and then filter

Gilbert09 · 2023-10-12T13:23:14Z

@mariusandra PR fixes have been pushed - ready for a second review 🥳

mariusandra

LGTM

* Added the first iteration of trend insghts breakdowns * Removed team id from all events filters * Added support for histogram breakdowns * PR fixes * Support group breakdowns * Abstract property chain into a cached prop

Gilbert09 requested review from mariusandra and a team October 10, 2023 13:52

mariusandra reviewed Oct 10, 2023

View reviewed changes

mariusandra reviewed Oct 11, 2023

View reviewed changes

mariusandra reviewed Oct 12, 2023

View reviewed changes

Gilbert09 added 5 commits October 12, 2023 16:46

Added the first iteration of trend insghts breakdowns

370727c

Removed team id from all events filters

2b0a316

Added support for histogram breakdowns

f413262

PR fixes

adcd483

Support group breakdowns

16febec

Gilbert09 force-pushed the feat/trends-breakdown branch from 688e6e1 to 16febec Compare October 12, 2023 16:29

Abstract property chain into a cached prop

c0211d9

mariusandra approved these changes Oct 13, 2023

View reviewed changes

Gilbert09 merged commit 1ecf289 into master Oct 13, 2023
66 checks passed

Gilbert09 deleted the feat/trends-breakdown branch October 13, 2023 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added the first iteration of trend insghts breakdowns #17891

feat: Added the first iteration of trend insghts breakdowns #17891

Gilbert09 commented Oct 10, 2023

mariusandra left a comment

mariusandra left a comment

mariusandra Oct 11, 2023

Gilbert09 Oct 11, 2023

mariusandra Oct 11, 2023

Gilbert09 commented Oct 12, 2023 •

edited

Loading

mariusandra left a comment

mariusandra Oct 12, 2023

mariusandra Oct 12, 2023

mariusandra Oct 12, 2023

mariusandra Oct 12, 2023

mariusandra Oct 12, 2023

Gilbert09 Oct 12, 2023

mariusandra Oct 13, 2023

mariusandra Oct 12, 2023

Gilbert09 Oct 12, 2023

mariusandra Oct 12, 2023

mariusandra Oct 12, 2023

Gilbert09 Oct 12, 2023

mariusandra Oct 12, 2023

Gilbert09 Oct 12, 2023

mariusandra Oct 13, 2023

mariusandra Oct 12, 2023

Gilbert09 Oct 12, 2023

Gilbert09 commented Oct 12, 2023

mariusandra left a comment

	def to_persons_query(self) -> ast.SelectQuery \| ast.SelectUnionQuery:
	# TODO: add support for selecting and filtering by breakdowns
	with self.timings.measure("persons_query"):
	return parse_select(
	"""
	SELECT
	person_id, start_of_period as breakdown_1, status as breakdown_2
	FROM
	{events_query}
	""",
	placeholders={"events_query": self.events_query},
	)

	values = list(map(lambda t: f"[{t[0]},{t[1]}]", buckets))
	values = [f"[{t[0]},{t[1]}]" for t in buckets]

	return ast.Array(exprs=list(map(lambda v: ast.Constant(value=v), self._get_breakdown_values)))
	return ast.Array(exprs=[ast.Constant(value=v) for v in self._breakdown_values])

		if self._breakdown.enabled and not self._breakdown.is_histogram_breakdown:
		filters.append(self._breakdown.events_where_filter())

feat: Added the first iteration of trend insghts breakdowns #17891

feat: Added the first iteration of trend insghts breakdowns #17891

Conversation

Gilbert09 commented Oct 10, 2023

Problem

Changes

How did you test this code?

mariusandra left a comment

Choose a reason for hiding this comment

mariusandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gilbert09 commented Oct 12, 2023 • edited Loading

mariusandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gilbert09 commented Oct 12, 2023

mariusandra left a comment

Choose a reason for hiding this comment

Gilbert09 commented Oct 12, 2023 •

edited

Loading