Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web-analytics): Add virtual hogql column for what type of traffic (organic search, paid social, etc) a person is #19085

Merged
merged 24 commits into from
Dec 14, 2023

Conversation

robbie-c
Copy link
Member

@robbie-c robbie-c commented Dec 5, 2023

Problem

Customers would like to know broadly what category their traffic comes under

#18713

Merge PostHog/posthog-js#934 alongside this

Changes

Add a virtual column for referring domain type, and also channel type

How did you test this code?

Wrote tests

@robbie-c robbie-c changed the title Add referring domain type feat(web-analytics): Add referring domain type Dec 5, 2023
@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 1)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@robbie-c robbie-c requested a review from mariusandra December 5, 2023 12:12
@robbie-c robbie-c force-pushed the add-traffic-source branch 3 times, most recently from 0da6f75 to 336acaa Compare December 5, 2023 12:48
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had an initial look through and left some comments. Didn't have time to run it nor check the resolver error yet.

package.json Outdated
@@ -49,6 +49,7 @@
"editor:update-tsd": "pnpm packages:build && node frontend/editor-update-tsd.mjs",
"prettier": "prettier --write \"./**/*.{js,mjs,ts,tsx,json,yaml,yml,css,scss}\"",
"prettier:check": "prettier --check \"frontend/**/*.{js,mjs,ts,tsx,json,yaml,yml,css,scss}\"",
"prettier:file": "prettier --write",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably easier to write prettier --write directly, as nobody else will remember or know of this command, but 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be a package.json script, otherwise prettier won't be on the path. The way I usually do this would be to rename the prettier script to prettier:fix, leaving prettier to just be the raw tool. Then you can do yarn run prettier --write some/file.txt. I didn't do this because I didn't want to touch more things in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npx prettier --write works well in case your path doesn't automatically include the local scripts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to this in either pnpm or yarn, if you run pnpm prettier, it's exactly as if prettier was in $PATH. In this case pnpm prettier --write should 100% be equivalent to pnpm prettier:file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's normally true, but if you define a package script prettier (which we do on line 50 of package.json) then that takes precendence over just running prettier from .bin.

Just tested this
Screenshot 2023-12-12 at 10 06 21
Screenshot 2023-12-12 at 10 06 51

npx works, I don't usually like using it like this because by default it will install any missing dependencies, meaning that if you do have a bug with your dependencies it can make it harder to track down.

posthog/clickhouse/test/__snapshots__/test_schema.ambr Outdated Show resolved Hide resolved
posthog/hogql/database/schema/channel_type.py Outdated Show resolved Hide resolved
posthog/hogql/database/schema/channel_type.py Outdated Show resolved Hide resolved
Comment on lines 571 to 574
# dict
"dictGet": HogQLFunctionMeta("dictGet", 3, 3),
"dictGetOrNull": HogQLFunctionMeta("dictGetOrNull", 3, 3),
"dictGetOrDefault": HogQLFunctionMeta("dictGetOrDefault", 4, 4),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any dict-s that we don't want the user accessing? I think these functions should get an extra sanity check inside them (in the printer?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm potentially, another way would be for this to be a custom function, e.g. getDomainType or similar, and we don't expose dicts to the user

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work, but then we're exposing another internal method externally, which also isn't that ideal. E.g. what if ClickHouse one day also implements a method with the same name. It'll be JavaScript ProtoType hell all over again 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are gone, I now have custom hogql functions with a hogql_ prefix

posthog/management/commands/clean_ga4_sources_file.py Outdated Show resolved Hide resolved
posthog/models/channel_type/sql.py Outdated Show resolved Hide resolved
Comment on lines 171 to 174
database.persons.fields["$initial_referring_domain_type"] = create_initial_domain_type(
"$initial_referring_domain_type"
)
database.persons.fields["$initial_channel_type"] = create_initial_channel_type("$initial_channel_type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also now wonder if we want the $ here. It looks like a property, but isn't. Perhaps we should use some other way of namespacing these? __initial_... or such? I can imagine there will be a bit of confusion if it looks like a property, but you can't find it under properties, for e.g. filters and such.

Alternatively, maybe we do want to bring forward the idea of actual calculated properties, property schemas (type verification), and so forth? This way we could expose original_properties, and have a new magical properties column that "inherits" from it?

I realise this is a textbook example of scope creep though, and definitely not blocking this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - I used $virt to prefix the virtual columns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, now that these don't appear on asterisk expansions, I think we're less likely to run into accidental confusion

@robbie-c robbie-c force-pushed the add-traffic-source branch 3 times, most recently from bc6e793 to de175c3 Compare December 11, 2023 08:33
@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 0 modified, 0 deleted
  • webkit: 0 added, 1 modified, 0 deleted (diff for shard 2)

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 0 modified, 0 deleted
  • webkit: 0 added, 1 modified, 0 deleted (diff for shard 2)

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Contributor

Size Change: 0 B

Total Size: 1.83 MB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 1.83 MB

compressed-size-action

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, though I'd also have another set of eyes on the ClickHouse functions, as requested by the PR policy... 👍

Comment on lines +19 to +97
def create_initial_domain_type(name: str):
return ExpressionField(
name=name,
expr=parse_expr(
"""
if(
properties.$initial_referring_domain = '$direct',
'$direct',
hogql_lookupDomainType(properties.$initial_referring_domain)
)
"""
),
)


def create_initial_channel_type(name: str):
return ExpressionField(
name=name,
expr=parse_expr(
"""
multiIf(
match(properties.$initial_utm_campaign, 'cross-network'),
'Cross Network',

(
match(properties.$initial_utm_medium, '^(.*cp.*|ppc|retargeting|paid.*)$') OR
properties.$initial_gclid IS NOT NULL OR
properties.$initial_gad_source IS NOT NULL
),
coalesce(
hogql_lookupPaidSourceType(properties.$initial_utm_source),
hogql_lookupPaidDomainType(properties.$initial_referring_domain),
if(
match(properties.$initial_utm_campaign, '^(.*(([^a-df-z]|^)shop|shopping).*)$'),
'Paid Shopping',
NULL
),
hogql_lookupPaidMediumType(properties.$initial_utm_medium),
multiIf (
properties.$initial_gad_source = '1',
'Paid Search',

match(properties.$initial_utm_campaign, '^(.*video.*)$'),
'Paid Video',

'Paid Other'
)
),

(
properties.$initial_referring_domain = '$direct'
AND (properties.$initial_utm_medium IS NULL OR properties.$initial_utm_medium = '')
AND (properties.$initial_utm_source IS NULL OR properties.$initial_utm_source IN ('', '(direct)', 'direct'))
),
'Direct',

coalesce(
hogql_lookupOrganicSourceType(properties.$initial_utm_source),
hogql_lookupOrganicDomainType(properties.$initial_referring_domain),
if(
match(properties.$initial_utm_campaign, '^(.*(([^a-df-z]|^)shop|shopping).*)$'),
'Organic Shopping',
NULL
),
hogql_lookupOrganicMediumType(properties.$initial_utm_medium),
multiIf(
match(properties.$initial_utm_campaign, '^(.*video.*)$'),
'Organic Video',

match(properties.$initial_utm_medium, 'push$'),
'Push',

NULL
)
)
)""",
start=None,
),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can spot one room for improvement. Now if we create a database, for whichever SQL query, we will need to parse_expr this SQL string when we add these fields onto the events table (every time we create a database). Our new parser is fast, but this may add some bunch of milliseconds to every query. Perhaps worth measuring, and/or inlining? Alternatively, ExpressionField could/should just support strings, and do the parse_expr when you're actually using the field.

I don't think this is blocking for this PR, but just a good fix to do at one point 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if it would make sense to move all this SQL into a clickhouse function in the future anyway: hogql_getChannelType(referringDomain, utmSource, utmDomain, gclid, gadSource, fbclid)
Then this virtual column just becomes
hogql_getChannelType(properties.$initial_referring_domain, ...etc)
but I could also use it to attributes a specific session instead. That might reduce the amount of parsing needed. Inlining it or passing a string would also work.

@robbie-c robbie-c marked this pull request as ready for review December 11, 2023 12:44
@robbie-c robbie-c requested a review from fuziontech as a code owner December 11, 2023 12:44
@robbie-c robbie-c requested a review from Twixes December 11, 2023 12:44
@robbie-c robbie-c changed the title feat(web-analytics): Add referring domain type feat(web-analytics): Add virtual hogql column for what type of traffic (organic search, paid social, etc) a person is Dec 11, 2023
@robbie-c robbie-c enabled auto-merge (squash) December 11, 2023 14:38
Comment on lines +24 to +29
if(
properties.$initial_referring_domain = '$direct',
'$direct',
hogql_lookupDomainType(properties.$initial_referring_domain)
)
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a bit more readable with indentation

Comment on lines +745 to +751
"hogql_lookupDomainType": HogQLFunctionMeta("hogql_lookupDomainType", 1, 1),
"hogql_lookupPaidDomainType": HogQLFunctionMeta("hogql_lookupPaidDomainType", 1, 1),
"hogql_lookupPaidSourceType": HogQLFunctionMeta("hogql_lookupPaidSourceType", 1, 1),
"hogql_lookupPaidMediumType": HogQLFunctionMeta("hogql_lookupPaidMediumType", 1, 1),
"hogql_lookupOrganicDomainType": HogQLFunctionMeta("hogql_lookupOrganicDomainType", 1, 1),
"hogql_lookupOrganicSourceType": HogQLFunctionMeta("hogql_lookupOrganicSourceType", 1, 1),
"hogql_lookupOrganicMediumType": HogQLFunctionMeta("hogql_lookupOrganicMediumType", 1, 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hogql_ prefix is a new convention, why do this instead of a plain function name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid namespace collisions with clickhouse functions. Probably not likely to happen with these functions specifically but something we might run into in the future

Comment on lines +847 to +863
return f"dictGetOrNull('channel_definition_dict', 'domain_type', (cutToFirstSignificantSubdomain(coalesce({args[0]}, '')), 'source'))"
elif node.name == "hogql_lookupPaidDomainType":
return f"dictGetOrNull('channel_definition_dict', 'type_if_paid', (cutToFirstSignificantSubdomain(coalesce({args[0]}, '')), 'source'))"
elif node.name == "hogql_lookupPaidSourceType":
return (
f"dictGetOrNull('channel_definition_dict', 'type_if_paid', (coalesce({args[0]}, ''), 'source'))"
)
elif node.name == "hogql_lookupPaidMediumType":
return (
f"dictGetOrNull('channel_definition_dict', 'type_if_paid', (coalesce({args[0]}, ''), 'medium'))"
)
elif node.name == "hogql_lookupOrganicDomainType":
return f"dictGetOrNull('channel_definition_dict', 'type_if_organic', (cutToFirstSignificantSubdomain(coalesce({args[0]}, '')), 'source'))"
elif node.name == "hogql_lookupOrganicSourceType":
return f"dictGetOrNull('channel_definition_dict', 'type_if_organic', (coalesce({args[0]}, ''), 'source'))"
elif node.name == "hogql_lookupOrganicMediumType":
return f"dictGetOrNull('channel_definition_dict', 'type_if_organic', (coalesce({args[0]}, ''), 'medium'))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much more like resolution of a function into its expanded form (i.e. resolver task) rather than just printing of the AST (a printer task). It would be safer to do this as part of AST resolution, as well as easier to maintain, as this method here is already awfully complex – hence I suggest we put this into the resolver's visit_call method. @mariusandra What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that we don't want to expose dictGetOrNull as a method to users, as we don't know what dicts there are, or will be. Someone might YOLO something in a cluster, and all of a sudden users can access a dict with sensitive data.

)
PRIMARY KEY domain, kind
SOURCE(CLICKHOUSE(TABLE 'channel_definition'))
LIFETIME(MIN 1 MAX 3600)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the lifetime work? Curious why these values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary lives in memory, based off the clickhouse table. This is the min and max interval between updating the memory version of the dictionary. You have to provide a range because if you might have many clickhouse nodes and it would like to avoid hammering the source with all the nodes at the same time.

This data is very rarely updated (the source google doc is updated approximately once a year), so I just picked a large number with maximum flexibility. Not too large, as we don't want to have to wait half a day for changes to propagate, but one hour seemed fine, and we can run a manual command to update it if we need to.

It might not be a bad idea to increase the min, hopefully clickhouse is sensible and doesn't try to update every second, but setting it to 3000 would be safer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 27 to 34
class Command(BaseCommand):
help = (
"Write the channel_definitions.json file. Needs a ga4 sources file like"
"https://storage.googleapis.com/support-kms-prod/qn1xhBu8MVcZPIZ2WZMNdI40FtZXFPGYxj2K"
"as input. The best way I have found to do this is to open it in Google Docs, then save it as a text file"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often should this channel_definitions.json update be done? Thinking about long-term maintainability of this mechanism

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About once a year according to google, though we might want to update more frequently than that if we get requests from customers to add or change something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this in the comment :)

Copy link
Member

@fuziontech fuziontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel_definition table should probably be ReplicatedMergeTree if we want it to sync data between all nodes. Also curious if we have a race condition between loading the source table here, replication, and then loading of the dictionary 🤔


operations = [
run_sql_with_exceptions(CHANNEL_DEFINITION_TABLE_SQL),
run_sql_with_exceptions(CHANNEL_DEFINITIONS_DATA_SQL),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally a nit - feel free to ignore. You have plural of DEFINITIONS and singular of DEFINITION here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

type_if_paid String NULL,
type_if_organic String NULL,
) ENGINE = MergeTree()
ORDER BY domain;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would sync the ORDER BY key for this table with the PRIMARY KEY of the dictionary below.
https://github.com/PostHog/posthog/pull/19085/files#diff-35d4b57e1a6f6bcfa90872805e15f38daaf02e2d0f99045832e3a08a50d6f5c0R51

Looks like it should be

ORDER BY domain, kind

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

domain_type String NULL,
type_if_paid String NULL,
type_if_organic String NULL,
) ENGINE = MergeTree()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table should probably be ReplicatedMergeTree right? We don't ensure that migrations are run directly on every node (why we use ON CLUSTER to have the coordinating node pass the DDLs onto each member of the cluster) so when this migration runs you will only load the data onto one node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 43 to 65
CHANNEL_DEFINITION_DICTIONARY_SQL = """
CREATE DICTIONARY IF NOT EXISTS channel_definition_dict (
domain String,
kind String,
domain_type Nullable(String),
type_if_paid Nullable(String),
type_if_organic Nullable(String)
)
PRIMARY KEY domain, kind
SOURCE(CLICKHOUSE(TABLE 'channel_definition'))
LIFETIME(MIN 3000 MAX 3600)
LAYOUT(COMPLEX_KEY_HASHED())
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Curious to see how much faster this is than just a table. Having both we can do a bit of benchmarking.

Copy link
Member

@fuziontech fuziontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 it!

@robbie-c robbie-c merged commit 9fa5c7b into master Dec 14, 2023
72 checks passed
@robbie-c robbie-c deleted the add-traffic-source branch December 14, 2023 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants