Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support New User ID Tables in GA4 BigQuery Export #205

Open
dgitis opened this issue May 23, 2023 · 6 comments
Open

Support New User ID Tables in GA4 BigQuery Export #205

dgitis opened this issue May 23, 2023 · 6 comments

Comments

@dgitis
Copy link
Collaborator

dgitis commented May 23, 2023

Google is adding new fields and tables to the BigQuery export. Here's the announcement from Google IO.

The session-level attribution levels only remove the requirement to unnest data that we're already unnesting. However, they will be adding a new table for user_id data that will contain data for all users whose data changed in the export period (presumably this will be a partition).

For now, there's nothing that we can do without seeing the export in the wild, but I suggest we put a freeze on major changes to user data.

@willbryant @adamribaudo-velir

@dgitis
Copy link
Collaborator Author

dgitis commented May 29, 2023

Google has changed their documentation and have added the new session-level attribution columns to the BigQuery export.

The collected_traffic_source record is new and holds all of the fields highlighted in that Google IO announcement. I can't find an example where those fields are being populated.

My big question is whether these fields are being set for all events or for just the first batch. I presume it will be all events otherwise how does it differ from the existing source, medium, and gclid event parameters? However, this being Google, I don't want to count on it being how I expect.

@adamribaudo-velir
Copy link
Collaborator

Thanks for the heads up. Definitely something to monitor. We'll also have to decide whether to switch to this method or support both the legacy and new method to account for data collected before this change 🤔

@dgitis
Copy link
Collaborator Author

dgitis commented Jun 12, 2023

The collected_traffic_source fields are now populating. They seem to populate all events in except session_start and first_visit.

Here's a session with multiple timestamps that I found.

image

@dgitis
Copy link
Collaborator Author

dgitis commented Jun 30, 2023

I'd been hearing that the reality of the collected_traffic_sources is different from what I thought after my previous comment and I just verified that here is how these fields get populated.

The collected_traffic_sources fields get populated on all events that happen on the first page of a session except for first_visit and session_start. It is not the first batch events, as I disproved above.

Events on subsequent pages do not propagate the collected_traffic_sources fields.

Basically, these session fields only save us from unnesting the data.

Where session data is concerned, I consider this closed.

Google still plans on adding User-Id export to a completely separate table. I'll be updating the title of this to specify User-Id tables.

@dgitis dgitis changed the title Support New Fields and Tables in GA4 BigQuery Export Support New User ID Tables in GA4 BigQuery Export Jun 30, 2023
@adamribaudo-velir
Copy link
Collaborator

User ID & Pseudo User ID tables are documented here: https://support.google.com/analytics/answer/12769371

Just enabled the export on my company's GA4 instance so I hope to return to this soon.

@dgitis
Copy link
Collaborator Author

dgitis commented Apr 1, 2024

I checked this a couple of weeks ago and the daily user counts on these tables (user_is and user_pseudo_id) closely match the user counts on our equivalent tables.

These tables are substantially different from our user tables so I think the best path is to copy these into partitioned dim tables and have two separate sets of user tables with very different fields.

The other option is to enhance these new tables with the fields that we currently have in our old tables.

I'd rather be able to disable the old tables on high-volume projects because they aren't partitioned and many of the fields, like the first_* and last_* don't get used all that often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants