-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web-analytics): Add virtual hogql column for what type of traffic (organic search, paid social, etc) a person is #19085
Conversation
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
70aec9a
to
6a8d92b
Compare
0da6f75
to
336acaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had an initial look through and left some comments. Didn't have time to run it nor check the resolver error yet.
package.json
Outdated
@@ -49,6 +49,7 @@ | |||
"editor:update-tsd": "pnpm packages:build && node frontend/editor-update-tsd.mjs", | |||
"prettier": "prettier --write \"./**/*.{js,mjs,ts,tsx,json,yaml,yml,css,scss}\"", | |||
"prettier:check": "prettier --check \"frontend/**/*.{js,mjs,ts,tsx,json,yaml,yml,css,scss}\"", | |||
"prettier:file": "prettier --write", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably easier to write prettier --write
directly, as nobody else will remember or know of this command, but 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be a package.json script, otherwise prettier won't be on the path. The way I usually do this would be to rename the prettier
script to prettier:fix
, leaving prettier
to just be the raw tool. Then you can do yarn run prettier --write some/file.txt
. I didn't do this because I didn't want to touch more things in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npx prettier --write
works well in case your path doesn't automatically include the local scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to this in either pnpm
or yarn
, if you run pnpm prettier
, it's exactly as if prettier
was in $PATH
. In this case pnpm prettier --write
should 100% be equivalent to pnpm prettier:file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's normally true, but if you define a package script prettier
(which we do on line 50 of package.json) then that takes precendence over just running prettier
from .bin
.
npx
works, I don't usually like using it like this because by default it will install any missing dependencies, meaning that if you do have a bug with your dependencies it can make it harder to track down.
posthog/hogql/functions/mapping.py
Outdated
# dict | ||
"dictGet": HogQLFunctionMeta("dictGet", 3, 3), | ||
"dictGetOrNull": HogQLFunctionMeta("dictGetOrNull", 3, 3), | ||
"dictGetOrDefault": HogQLFunctionMeta("dictGetOrDefault", 4, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any dict
-s that we don't want the user accessing? I think these functions should get an extra sanity check inside them (in the printer?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm potentially, another way would be for this to be a custom function, e.g. getDomainType
or similar, and we don't expose dicts to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work, but then we're exposing another internal method externally, which also isn't that ideal. E.g. what if ClickHouse one day also implements a method with the same name. It'll be JavaScript ProtoType hell all over again 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are gone, I now have custom hogql functions with a hogql_
prefix
posthog/hogql/database/database.py
Outdated
database.persons.fields["$initial_referring_domain_type"] = create_initial_domain_type( | ||
"$initial_referring_domain_type" | ||
) | ||
database.persons.fields["$initial_channel_type"] = create_initial_channel_type("$initial_channel_type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also now wonder if we want the $
here. It looks like a property, but isn't. Perhaps we should use some other way of namespacing these? __initial_...
or such? I can imagine there will be a bit of confusion if it looks like a property, but you can't find it under properties, for e.g. filters and such.
Alternatively, maybe we do want to bring forward the idea of actual calculated properties, property schemas (type verification), and so forth? This way we could expose original_properties
, and have a new magical properties
column that "inherits" from it?
I realise this is a textbook example of scope creep though, and definitely not blocking this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed - I used $virt to prefix the virtual columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, now that these don't appear on asterisk expansions, I think we're less likely to run into accidental confusion
bc6e793
to
de175c3
Compare
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
Size Change: 0 B Total Size: 1.83 MB ℹ️ View Unchanged
|
1201a14
to
ceb2bd0
Compare
d9148e7
to
7587d25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, though I'd also have another set of eyes on the ClickHouse functions, as requested by the PR policy... 👍
def create_initial_domain_type(name: str): | ||
return ExpressionField( | ||
name=name, | ||
expr=parse_expr( | ||
""" | ||
if( | ||
properties.$initial_referring_domain = '$direct', | ||
'$direct', | ||
hogql_lookupDomainType(properties.$initial_referring_domain) | ||
) | ||
""" | ||
), | ||
) | ||
|
||
|
||
def create_initial_channel_type(name: str): | ||
return ExpressionField( | ||
name=name, | ||
expr=parse_expr( | ||
""" | ||
multiIf( | ||
match(properties.$initial_utm_campaign, 'cross-network'), | ||
'Cross Network', | ||
|
||
( | ||
match(properties.$initial_utm_medium, '^(.*cp.*|ppc|retargeting|paid.*)$') OR | ||
properties.$initial_gclid IS NOT NULL OR | ||
properties.$initial_gad_source IS NOT NULL | ||
), | ||
coalesce( | ||
hogql_lookupPaidSourceType(properties.$initial_utm_source), | ||
hogql_lookupPaidDomainType(properties.$initial_referring_domain), | ||
if( | ||
match(properties.$initial_utm_campaign, '^(.*(([^a-df-z]|^)shop|shopping).*)$'), | ||
'Paid Shopping', | ||
NULL | ||
), | ||
hogql_lookupPaidMediumType(properties.$initial_utm_medium), | ||
multiIf ( | ||
properties.$initial_gad_source = '1', | ||
'Paid Search', | ||
|
||
match(properties.$initial_utm_campaign, '^(.*video.*)$'), | ||
'Paid Video', | ||
|
||
'Paid Other' | ||
) | ||
), | ||
|
||
( | ||
properties.$initial_referring_domain = '$direct' | ||
AND (properties.$initial_utm_medium IS NULL OR properties.$initial_utm_medium = '') | ||
AND (properties.$initial_utm_source IS NULL OR properties.$initial_utm_source IN ('', '(direct)', 'direct')) | ||
), | ||
'Direct', | ||
|
||
coalesce( | ||
hogql_lookupOrganicSourceType(properties.$initial_utm_source), | ||
hogql_lookupOrganicDomainType(properties.$initial_referring_domain), | ||
if( | ||
match(properties.$initial_utm_campaign, '^(.*(([^a-df-z]|^)shop|shopping).*)$'), | ||
'Organic Shopping', | ||
NULL | ||
), | ||
hogql_lookupOrganicMediumType(properties.$initial_utm_medium), | ||
multiIf( | ||
match(properties.$initial_utm_campaign, '^(.*video.*)$'), | ||
'Organic Video', | ||
|
||
match(properties.$initial_utm_medium, 'push$'), | ||
'Push', | ||
|
||
NULL | ||
) | ||
) | ||
)""", | ||
start=None, | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can spot one room for improvement. Now if we create a database, for whichever SQL query, we will need to parse_expr
this SQL string when we add these fields onto the events table (every time we create a database). Our new parser is fast, but this may add some bunch of milliseconds to every query. Perhaps worth measuring, and/or inlining? Alternatively, ExpressionField
could/should just support strings, and do the parse_expr
when you're actually using the field.
I don't think this is blocking for this PR, but just a good fix to do at one point 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if it would make sense to move all this SQL into a clickhouse function in the future anyway: hogql_getChannelType(referringDomain, utmSource, utmDomain, gclid, gadSource, fbclid)
Then this virtual column just becomes
hogql_getChannelType(properties.$initial_referring_domain, ...etc)
but I could also use it to attributes a specific session instead. That might reduce the amount of parsing needed. Inlining it or passing a string would also work.
953f7a0
to
c31ec03
Compare
if( | ||
properties.$initial_referring_domain = '$direct', | ||
'$direct', | ||
hogql_lookupDomainType(properties.$initial_referring_domain) | ||
) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a bit more readable with indentation
"hogql_lookupDomainType": HogQLFunctionMeta("hogql_lookupDomainType", 1, 1), | ||
"hogql_lookupPaidDomainType": HogQLFunctionMeta("hogql_lookupPaidDomainType", 1, 1), | ||
"hogql_lookupPaidSourceType": HogQLFunctionMeta("hogql_lookupPaidSourceType", 1, 1), | ||
"hogql_lookupPaidMediumType": HogQLFunctionMeta("hogql_lookupPaidMediumType", 1, 1), | ||
"hogql_lookupOrganicDomainType": HogQLFunctionMeta("hogql_lookupOrganicDomainType", 1, 1), | ||
"hogql_lookupOrganicSourceType": HogQLFunctionMeta("hogql_lookupOrganicSourceType", 1, 1), | ||
"hogql_lookupOrganicMediumType": HogQLFunctionMeta("hogql_lookupOrganicMediumType", 1, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hogql_
prefix is a new convention, why do this instead of a plain function name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to avoid namespace collisions with clickhouse functions. Probably not likely to happen with these functions specifically but something we might run into in the future
return f"dictGetOrNull('channel_definition_dict', 'domain_type', (cutToFirstSignificantSubdomain(coalesce({args[0]}, '')), 'source'))" | ||
elif node.name == "hogql_lookupPaidDomainType": | ||
return f"dictGetOrNull('channel_definition_dict', 'type_if_paid', (cutToFirstSignificantSubdomain(coalesce({args[0]}, '')), 'source'))" | ||
elif node.name == "hogql_lookupPaidSourceType": | ||
return ( | ||
f"dictGetOrNull('channel_definition_dict', 'type_if_paid', (coalesce({args[0]}, ''), 'source'))" | ||
) | ||
elif node.name == "hogql_lookupPaidMediumType": | ||
return ( | ||
f"dictGetOrNull('channel_definition_dict', 'type_if_paid', (coalesce({args[0]}, ''), 'medium'))" | ||
) | ||
elif node.name == "hogql_lookupOrganicDomainType": | ||
return f"dictGetOrNull('channel_definition_dict', 'type_if_organic', (cutToFirstSignificantSubdomain(coalesce({args[0]}, '')), 'source'))" | ||
elif node.name == "hogql_lookupOrganicSourceType": | ||
return f"dictGetOrNull('channel_definition_dict', 'type_if_organic', (coalesce({args[0]}, ''), 'source'))" | ||
elif node.name == "hogql_lookupOrganicMediumType": | ||
return f"dictGetOrNull('channel_definition_dict', 'type_if_organic', (coalesce({args[0]}, ''), 'medium'))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much more like resolution of a function into its expanded form (i.e. resolver task) rather than just printing of the AST (a printer task). It would be safer to do this as part of AST resolution, as well as easier to maintain, as this method here is already awfully complex – hence I suggest we put this into the resolver's visit_call
method. @mariusandra What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that we don't want to expose dictGetOrNull
as a method to users, as we don't know what dicts there are, or will be. Someone might YOLO something in a cluster, and all of a sudden users can access a dict with sensitive data.
posthog/models/channel_type/sql.py
Outdated
) | ||
PRIMARY KEY domain, kind | ||
SOURCE(CLICKHOUSE(TABLE 'channel_definition')) | ||
LIFETIME(MIN 1 MAX 3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the lifetime work? Curious why these values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary lives in memory, based off the clickhouse table. This is the min and max interval between updating the memory version of the dictionary. You have to provide a range because if you might have many clickhouse nodes and it would like to avoid hammering the source with all the nodes at the same time.
This data is very rarely updated (the source google doc is updated approximately once a year), so I just picked a large number with maximum flexibility. Not too large, as we don't want to have to wait half a day for changes to propagate, but one hour seemed fine, and we can run a manual command to update it if we need to.
It might not be a bad idea to increase the min, hopefully clickhouse is sensible and doesn't try to update every second, but setting it to 3000 would be safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
class Command(BaseCommand): | ||
help = ( | ||
"Write the channel_definitions.json file. Needs a ga4 sources file like" | ||
"https://storage.googleapis.com/support-kms-prod/qn1xhBu8MVcZPIZ2WZMNdI40FtZXFPGYxj2K" | ||
"as input. The best way I have found to do this is to open it in Google Docs, then save it as a text file" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often should this channel_definitions.json
update be done? Thinking about long-term maintainability of this mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About once a year according to google, though we might want to update more frequently than that if we get requests from customers to add or change something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this in the comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel_definition
table should probably be ReplicatedMergeTree
if we want it to sync data between all nodes. Also curious if we have a race condition between loading the source table here, replication, and then loading of the dictionary 🤔
|
||
operations = [ | ||
run_sql_with_exceptions(CHANNEL_DEFINITION_TABLE_SQL), | ||
run_sql_with_exceptions(CHANNEL_DEFINITIONS_DATA_SQL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally a nit - feel free to ignore. You have plural of DEFINITIONS
and singular of DEFINITION
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
posthog/models/channel_type/sql.py
Outdated
type_if_paid String NULL, | ||
type_if_organic String NULL, | ||
) ENGINE = MergeTree() | ||
ORDER BY domain; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would sync the ORDER BY
key for this table with the PRIMARY KEY
of the dictionary below.
https://github.com/PostHog/posthog/pull/19085/files#diff-35d4b57e1a6f6bcfa90872805e15f38daaf02e2d0f99045832e3a08a50d6f5c0R51
Looks like it should be
ORDER BY domain, kind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
posthog/models/channel_type/sql.py
Outdated
domain_type String NULL, | ||
type_if_paid String NULL, | ||
type_if_organic String NULL, | ||
) ENGINE = MergeTree() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table should probably be ReplicatedMergeTree
right? We don't ensure that migrations are run directly on every node (why we use ON CLUSTER
to have the coordinating node pass the DDLs onto each member of the cluster) so when this migration runs you will only load the data onto one node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
posthog/models/channel_type/sql.py
Outdated
CHANNEL_DEFINITION_DICTIONARY_SQL = """ | ||
CREATE DICTIONARY IF NOT EXISTS channel_definition_dict ( | ||
domain String, | ||
kind String, | ||
domain_type Nullable(String), | ||
type_if_paid Nullable(String), | ||
type_if_organic Nullable(String) | ||
) | ||
PRIMARY KEY domain, kind | ||
SOURCE(CLICKHOUSE(TABLE 'channel_definition')) | ||
LIFETIME(MIN 3000 MAX 3600) | ||
LAYOUT(COMPLEX_KEY_HASHED()) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! Curious to see how much faster this is than just a table. Having both we can do a bit of benchmarking.
f373cd1
to
1722dad
Compare
0ca2c27
to
86e98f9
Compare
fa70e10
to
2f23081
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 it!
Problem
Customers would like to know broadly what category their traffic comes under
#18713
Merge PostHog/posthog-js#934 alongside this
Changes
Add a virtual column for referring domain type, and also channel type
How did you test this code?
Wrote tests