Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a sample script to validate Metric Hub definitions #527

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bochocki
Copy link
Contributor

I used a script similar to this to validate definition changes in #483 and #523 after merging. @jmsilverman suggested it could be helpful to check the script into the repo.

Open to a discussion about whether or not this is useful and, if not, if there's something we can do instead.

I used a script similar to this to validate definition changes in #483 and #523 after merging. @jmsilverman suggested it could be helpful to check the script into the repo.

Open to a discussion about whether or not this is useful and, if not, if there's something we can do instead.
Copy link

@jmsilverman jmsilverman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting a couple small code changes via inline comments.

As Brad mentioned, I think this script is useful to have in the repo and I ran it in Colab and it worked great!

I'm not totally sure if it belongs in the README vs. as a standalone script/file, so I'm curious to hear other people's opinions on that.

definitions/README.md Show resolved Hide resolved
definitions/README.md Show resolved Hide resolved
Copy link
Contributor

@danielkberry danielkberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea, just a few suggestions

# Modify the metric source table string so that it formats nicely in the query.
from_expression = metric.data_source._from_expr.replace("\n", "\n" + " " * 15)

query = dedent(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding the query builder as a method inside of DataSource, something like DataSource.build_validation_query(start_date: str, end_date: str) -> str. Then we can expose that through the Metric through something like:

# in Metric class
def build_validation_query(self, start_date: str, end_date: str) -> str:
    return self.data_source.build_validation_query(start_date, end_date)

In the DataSource method, we can handle modifying the from expression like how build_query does it. Additionally, that gives us a separate method to unit test. Right now, the logic is hidden within this larger function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, this function can be simplified to:

metric = ConfigLoader.get_metric(metric_slug=metric_slug, app_name=app_name)
query = metric.build_validation_query(start_date, end_date)
df = bigquery.Client(project=bq_project).query(query).to_dataframe()
df[submission_date_column] = pd.to_datetime(df[submission_date_column]).dt.date
return df

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really, really like the idea of having an attribute on Metric that provides a query. I don't know if it's difficult to do in the general case, but it's something I think could be useful outside of this case. For example, there's a class in the forecasting code whose main goal is to build a query from Metric Hub components.

definitions/README.md Show resolved Hide resolved
definitions/README.md Show resolved Hide resolved
{metric.select_expr} AS value
FROM {from_expression}
WHERE {submission_date_column} BETWEEN '{start_date}' AND '{end_date}'
GROUP BY {submission_date_column}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most metric definitions assume that we're grouping by client_id too, so we should add that to the GROUP BY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a programmatic way to figure out what other groupby dimensions might be?

definitions/README.md Outdated Show resolved Hide resolved
@bochocki bochocki marked this pull request as draft July 22, 2024 19:39
@bochocki
Copy link
Contributor Author

Changed to WIP so it's clear this isn't meant to be merged right now. I also addressed some of the comments.

I think that having a way to build queries from Metric Hub objects would be a useful addition to mozanalysis and would probably be more valuable than documenting a script like this somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants