-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web-analytics): Add backend version of web analytics queries #17629
Conversation
Hey @robbie-c! 👋 |
106d7b3
to
2ce110e
Compare
850b612
to
297be07
Compare
@@ -5,7 +5,7 @@ import { SSO_PROVIDER_NAMES } from 'lib/constants' | |||
import { preflightLogic } from 'scenes/PreflightCheck/preflightLogic' | |||
import { SSOProvider } from '~/types' | |||
|
|||
interface SSOSelectInterface { | |||
export interface SSOSelectInterface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by fix, fixing a TypeScript error in some test code (which therefore doesn't trigger CI failures, but does still show up in my editor)
3436a0b
to
789c19d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! I left some loose thoughts below regarding code placement, etc.
const results = !response | ||
? null | ||
: 'results' in response && Array.isArray(response.results) | ||
? response.results | ||
: 'result' in response && Array.isArray(response.result) | ||
? response.result | ||
: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find a way to standardise here 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, hogql and hogql insights were already different here 🙉 I could just change one but I'm a bit scared of sweeping changes like that. Probably safest to change the hogql insights given it's not running in prod yet, and could do that in a different PR to keep this one less noisy.
41ee532
to
860ebc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more thoughts
""" | ||
WITH | ||
|
||
session_cte AS ( | ||
SELECT | ||
events.properties.`$session_id` AS session_id, | ||
min(events.timestamp) AS min_timestamp, | ||
max(events.timestamp) AS max_timestamp, | ||
dateDiff('second', min_timestamp, max_timestamp) AS duration_s, | ||
|
||
-- create a tuple so that these are grouped in the same order, see https://github.com/ClickHouse/ClickHouse/discussions/42338 | ||
groupArray((events.timestamp, events.properties.`$referrer`, events.properties.`$pathname`, events.properties.utm_source)) AS tuple_array, | ||
arrayFirstIndex(x -> tupleElement(x, 1) == min_timestamp, tuple_array) as index_of_earliest, | ||
arrayFirstIndex(x -> tupleElement(x, 1) == max_timestamp, tuple_array) as index_of_latest, | ||
tupleElement(arrayElement( | ||
tuple_array, | ||
index_of_earliest | ||
), 2) AS earliest_referrer, | ||
tupleElement(arrayElement( | ||
tuple_array, | ||
index_of_earliest | ||
), 3) AS earliest_pathname, | ||
tupleElement(arrayElement( | ||
tuple_array, | ||
index_of_earliest | ||
), 4) AS earliest_utm_source, | ||
|
||
if(domain(earliest_referrer) = '', earliest_referrer, domain(earliest_referrer)) AS referrer_domain, | ||
multiIf( | ||
earliest_utm_source IS NOT NULL, earliest_utm_source, | ||
-- This will need to be an approach that scales better | ||
referrer_domain == 'app.posthog.com', 'posthog', | ||
referrer_domain == 'eu.posthog.com', 'posthog', | ||
referrer_domain == 'posthog.com', 'posthog', | ||
referrer_domain == 'www.google.com', 'google', | ||
referrer_domain == 'www.google.co.uk', 'google', | ||
referrer_domain == 'www.google.com.hk', 'google', | ||
referrer_domain == 'www.google.de', 'google', | ||
referrer_domain == 't.co', 'twitter', | ||
referrer_domain == 'github.com', 'github', | ||
referrer_domain == 'duckduckgo.com', 'duckduckgo', | ||
referrer_domain == 'www.bing.com', 'bing', | ||
referrer_domain == 'bing.com', 'bing', | ||
referrer_domain == 'yandex.ru', 'yandex', | ||
referrer_domain == 'quora.com', 'quora', | ||
referrer_domain == 'www.quora.com', 'quora', | ||
referrer_domain == 'linkedin.com', 'linkedin', | ||
referrer_domain == 'www.linkedin.com', 'linkedin', | ||
startsWith(referrer_domain, 'http://localhost:'), 'localhost', | ||
referrer_domain | ||
) AS blended_source, | ||
|
||
countIf(events.event == '$pageview') AS num_pageviews, | ||
countIf(events.event == '$autocapture') AS num_autocaptures, | ||
-- in v1 we'd also want to count whether there were any conversion events | ||
|
||
any(events.person_id) as person_id, | ||
-- definition of a GA4 bounce from here https://support.google.com/analytics/answer/12195621?hl=en | ||
(num_autocaptures == 0 AND num_pageviews <= 1 AND duration_s < 10) AS is_bounce | ||
FROM | ||
events | ||
WHERE | ||
session_id IS NOT NULL | ||
AND | ||
events.timestamp >= now() - INTERVAL 8 DAY | ||
GROUP BY | ||
events.properties.`$session_id` | ||
HAVING | ||
min_timestamp >= now() - INTERVAL 7 DAY | ||
) | ||
|
||
|
||
|
||
SELECT | ||
blended_source, | ||
count(num_pageviews) as total_pageviews, | ||
count(DISTINCT person_id) as unique_visitors, | ||
avg(is_bounce) AS bounce_rate | ||
FROM | ||
session_cte | ||
WHERE | ||
blended_source IS NOT NULL | ||
GROUP BY blended_source | ||
|
||
ORDER BY total_pageviews DESC | ||
LIMIT 100 | ||
""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These long queries look like something that takes a long time for HogQL to parse 🙈. 11+sec for me locally. Hopefully the C++ parser will solve this.
class WebAnalyticsQueryRunner(QueryRunner, ABC): | ||
def _is_stale(self, cached_result_package): | ||
return True | ||
|
||
def _refresh_frequency(self): | ||
return BASE_MINIMUM_INSIGHT_REFRESH_INTERVAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope / random thought: if we're going to add many queries, these should also get defaults in the base runner, and be overidden on a as-needed basis.
@@ -6,38 +6,38 @@ import { NodeKind } from '~/queries/schema' | |||
export function WebAnalyticsScene(): JSX.Element { | |||
return ( | |||
<div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a <div>
tag around all the titles, and replacing this line with
<div> | |
<div className='space-y-2'> |
or so.
Check out the https://storybook.posthog.net/ for more style guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm intentionally not including any styling changes in this PR - that'll be a separate PR. The text here is just for my benefit as I develop it.
from posthog.schema import WebTopPagesQuery, WebTopPagesQueryResponse | ||
|
||
|
||
class TopPagesQueryRunner(WebAnalyticsQueryRunner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something with this query is not working for me. I'm getting a Illegal type String of argument for aggregate function avg.
error. A quick check locally under data management shows that I don't have the $prev_pageview_max_scroll_percentage
property at all, so it's likely treated as a string, not an integer. I wouldn't consider this blocking for an alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have the same problem, I'll need to figure this out before we roll it out to new people. I naively added a toFloat
around this, which didn't work either, but I'm sure there must be something I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a very naive toString -> toFloat might work?
The alternative is to make sure that the required property definitions exist with the right types, when loading the page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer was JSONExtractRaw -> toFloat, will be in a separate PR
5dd664d
to
2225280
Compare
Co-authored-by: Marius Andra <[email protected]>
80d0191
to
9fa716a
Compare
9fa716a
to
9698c04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nits, non blocking.
class TestSubclassQueryRunner(TestQueryRunner): # type: ignore | ||
pass | ||
|
||
team = Team.objects.create(pk=42, organization=self.organization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know when and where we're running these tests. pk=42
might already exist. Why not leave it empty and get an autoassigned one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment. The first one of these tests already existed, I'd just copied it and adapted it for a slightly different case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. I guess the ID is hardcoded to have a stable cache key. So all ok 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my last nit here is that if every other file in the hogql_queries
folder follows the "filename is an underscore version of the file's main class's name", then what excuse do these three files have to be different? "Not enough disk space" is not a right answer. "It was faster to write it if it's shorter" is also the wrong answer.
A good book I recommend to people with regards to website usability is called "Don't make me think". The main premise is that if you force your users to stop and think, breaking them out of an intuitive flow, your UX is bad. Same here: I need to think why these are different.
This is not blocking, but just causes me to stop and 🤔. Best case we ignore this, worse case, someone 6 months from now, will also take time to ponder the inconsistency, and then spend time making a PR to rename them, and will take someone else's time to review the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning was
- Can parse it visually very easily. Something I didn't like before was that query_runner, lifecycle_query_runner, insight_query_runner, etc, all looked very similar and it wasn't immediate which file was the special one that contained the base class, vs all of the files that contained a subclass
- It's already in the web analytics folder, so the filename doesn't need to include the words web analytics
- It's useful for the class names to keep the words web analytics, because they're used outside of this folder, and it provides context to the person reading the code that uses them.
Problem
Follow on from #17626
Changes
Move the Top Sources query to the back end
How did you test this code?
(Behind FF) ran it manually