Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for Malformed Events with page_location = '/' in fct_ga4__pages #311

Open
erikverheij opened this issue Mar 23, 2024 · 9 comments

Comments

@erikverheij
Copy link
Contributor

I've encountered an issue with some events in the base_ga4__events model where page_location is set to '/' instead of containing a full URL. This issue appears across multiple GA4 properties and leads to failures in the tests defined for fct_ga4__pages, specifically the uniqueness test on the combination of page_location and event_date_dt.

Here's the test that's failing:

# models/marts/core/fct_ga4__pages.yml
# ... 
tests:
  - unique:
      column_name: "(page_location || event_date_dt)"

The failure message is:

Failure in test unique_fct_ga4__pages__page_location_event_date_dt_ (models/marts/core/fct_ga4__pages.yml)

To resolve this issue, I'm considering a workaround where records with page_location = '/' are removed at an early stage in the transformation process. However, I'm unsure of the best approach to implement this workaround without diverging significantly from the package's intended usage and maintaining upgradeability.

Could you provide guidance on how to best address malformed page_location values within the framework of the package? Is there a recommended approach for filtering out or correcting these records before they impact downstream models and tests?

Thank you for your assistance.

@adamribaudo-velir
Copy link
Collaborator

Seems to me like the issue is with the test - at the very least it should include stream_id. @dgitis ?

I'd like to hear more about why having a page_location set to / is a problem, @erikverheij . Can you expand on that, ignoring the fct_ga4__pages model for a minute?

@erikverheij
Copy link
Contributor Author

Hi @adamribaudo-velir, thanks for your quick reply.

Usually, the page_location contains a full domain name. I'm not sure how these entries entered our GA4 account.

@adamribaudo-velir
Copy link
Collaborator

@erikverheij oh ok. For some reason I thought page_location didn't include the hostname. It sounds like there's an issue with your data collection then, yes? Maybe that field is being overwritten with page path in GTM or something?

Regardless, the faulty data has already been collected so I think your best bet is to set the severity of the error you noticed to warn so that it doesn't block your dbt build https://docs.getdbt.com/reference/resource-configs/severity

@dgitis
Copy link
Collaborator

dgitis commented Mar 25, 2024

We had an issue with multi-site related to this that was recently fixed by @yamotech.

That test should be column_name: "(event_date_dt || stream_id || page_location)". I think that was before the last release so updating your packages.yml file to pull from Git will work until a new release comes out.

packages:
  - git: "https://github.com/Velir/dbt-ga4.git"

However, that would not cause the page_location to be a /.

This looks to me to be a data collection problem like maybe someone decided to override the page_location with the page path thus stripping the domain.

@erikverheij
Copy link
Contributor Author

Hi @dgitis, thx for the suggestion.

This looks to me to be a data collection problem like maybe someone decided to override the page_location with the page path thus stripping the domain.

Looks something like that indeed.. Is there a way to hook into the flow somewhere at the start or even before the start to remove these entries before transforming with this package?

@dgitis
Copy link
Collaborator

dgitis commented Mar 28, 2024

The failure is at fct_ga4__pages.

The package is designed for transformation that happen to all events get put in stg_ga4__events and event-specific transformations happen in the corresponding stg_ga4__event_* model.

What you should do is override the stg_ga4__events model and slip in a step that adds the hostname back in to page_location so that it doesn't error downstream.

To do this, you'll need to create a stg_ga4__events model in your project models folder (anywhere is fine, but I recommend using folders for organization), copy over the contents of the package stg_ga4__events file and modify the code to fix page_location before that field gets processed in other ways.

Then you'll need to disable the package version in your dbt_project.yml file's models section like this:

models:
  ga4:
    staging:
      stg_ga4__events:
        +enabled: false

That code was done from memory, so it may not be exactly correct.

@adamribaudo-velir
Copy link
Collaborator

Just an idea @erikverheij , but you could override the 'base select' macro to insert your own logic for parsing page_location rather than overriding the stg_ga4__events model https://github.com/Velir/dbt-ga4/blob/main/macros/base_select.sql

Only slightly cleaner, but thought I'd throw it out there.

@erikverheij
Copy link
Contributor Author

Thanks guys for providing some ways to workaround it! That will do the trick for me for now.

It would be very nice if there's a way to do some filtering before everything starts, without the need to override logic from the package. I'm not experienced in DBT, but perhaps something like this would be possible?:

Adding a placeholder macro with a WHERE clause in base_ga4__events that contains no logic in itself but can be overwritten to filter out some specific events.

Note, that this is only a suggestion for the nice-to-have list as the workaround works for me. Such a feature would make it easier to stay aligned with the logic from this repo.

@dgitis
Copy link
Collaborator

dgitis commented Apr 4, 2024

I've definitely thought about adding a custom SQL variable that, if set, adds a with custom_sql as ( {{ var('model_name_custom_sql') }}) block or, more likely some sort of {{ if macro_exists('model_name_custom_sql') }} logic, but I feel like both of those options are more complex than overriding the existing macros/models using dbt-native mechanisms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants