-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a sample script to validate Metric Hub definitions #527
base: main
Are you sure you want to change the base?
Conversation
I used a script similar to this to validate definition changes in #483 and #523 after merging. @jmsilverman suggested it could be helpful to check the script into the repo. Open to a discussion about whether or not this is useful and, if not, if there's something we can do instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting a couple small code changes via inline comments.
As Brad mentioned, I think this script is useful to have in the repo and I ran it in Colab and it worked great!
I'm not totally sure if it belongs in the README vs. as a standalone script/file, so I'm curious to hear other people's opinions on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea, just a few suggestions
# Modify the metric source table string so that it formats nicely in the query. | ||
from_expression = metric.data_source._from_expr.replace("\n", "\n" + " " * 15) | ||
|
||
query = dedent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding the query builder as a method inside of DataSource
, something like DataSource.build_validation_query(start_date: str, end_date: str) -> str
. Then we can expose that through the Metric
through something like:
# in Metric class
def build_validation_query(self, start_date: str, end_date: str) -> str:
return self.data_source.build_validation_query(start_date, end_date)
In the DataSource
method, we can handle modifying the from expression like how build_query
does it. Additionally, that gives us a separate method to unit test. Right now, the logic is hidden within this larger function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, this function can be simplified to:
metric = ConfigLoader.get_metric(metric_slug=metric_slug, app_name=app_name)
query = metric.build_validation_query(start_date, end_date)
df = bigquery.Client(project=bq_project).query(query).to_dataframe()
df[submission_date_column] = pd.to_datetime(df[submission_date_column]).dt.date
return df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really, really like the idea of having an attribute on Metric that provides a query. I don't know if it's difficult to do in the general case, but it's something I think could be useful outside of this case. For example, there's a class in the forecasting code whose main goal is to build a query from Metric Hub components.
{metric.select_expr} AS value | ||
FROM {from_expression} | ||
WHERE {submission_date_column} BETWEEN '{start_date}' AND '{end_date}' | ||
GROUP BY {submission_date_column} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most metric definitions assume that we're grouping by client_id
too, so we should add that to the GROUP BY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a programmatic way to figure out what other groupby dimensions might be?
Changed to WIP so it's clear this isn't meant to be merged right now. I also addressed some of the comments. I think that having a way to build queries from Metric Hub objects would be a useful addition to |
I used a script similar to this to validate definition changes in #483 and #523 after merging. @jmsilverman suggested it could be helpful to check the script into the repo.
Open to a discussion about whether or not this is useful and, if not, if there's something we can do instead.