Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins conditions not working with backfill data #14

Open
PowdyPowPow opened this issue Sep 5, 2018 · 0 comments
Open

Joins conditions not working with backfill data #14

PowdyPowPow opened this issue Sep 5, 2018 · 0 comments

Comments

@PowdyPowPow
Copy link

I think the source of this problem is the way Bigquery Transfer Service handles backfills - ie when doing backfills Match tables are not updated. Source

For example we started using the transfers service on the 20th August and ALL the tables started syncing. Since then, we have a row (snapshot) for every single table for every single day. That is the _DATA_DATE field based on _PARTITIONTIME.

Since we wanted to get historic data we initiated a backfill and it populated all BUT the Match tables with a row for each day in the past. So for example the MIN(_DATA_DATE) in the campaigns table is still 20.8.2018, whereas in the campaign_stats table that MIN(_DATA_DATE) is currently 1.3.2018.

Now in the Looker block in the explore all the joins between the stats and Match tables happen on the id AND _DATA_DATE like this: sql_on: ${master_stats.campaign_id} = ${campaign.campaign_id} AND ${master_stats._data_raw} = ${campaign._data_raw}

But for all the historic (backfill) data that join will never work since the Match tables are not updated during backfill.

My understanding would be that this particular join condition is there to make sure you use the values of the Match table that were valid at the time the stats were taken. Which is perfectly fine and it works as well if you have the data for match tables for the same period (in our case that is everything after 20.8.2018. But since we don't have the data going that far back we are getting null values when joining with campaign.

A possible fix is to change the condition to use the latest campaign value like so: sql_on: ${master_stats.campaign_id} = ${campaign.campaign_id} AND ${campaign._data_raw} = ${campaign._latest}
This would mean you are always looking at the latest values of the campaign but at least you would be getting some values. I guess another option would be to try to figure out if the join can happen before the data exists and if it doesn't, use the latest version (or first for that matter).

It this something you have considered, does that make sense or am i missing something and there is a reason why it was done that way?

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant