Incorporate some form of experiment tracking in nightly builds #2944

zschira · 2023-10-16T21:08:39Z

zschira
Oct 16, 2023
Maintainer

Background

While working on replacing the FERC-FERC inter-year plant matching, I've been considering metrics that might be useful to keep track of to see trends overtime or significant changes between builds. The kind of metrics I'm thinking of are things that might be hard to write a good validation test for (i.e. we don't have "correct" value or range of values is), so tracking trends might be more informative. One such metric could be the number of unique FERC plant IDs assigned during matching. If this were to change significantly when we haven't integrated any new data or changed the matching process, we would probably want to be aware of that. I'm sure there are many other similar metrics, and it would be very useful to have a minimal framework of tests and infrastructure that would make it fast/easy to add new metrics to the tracking system, and have them recorded and analyzed during every build.

Tools

I'm sure there are many tools we could potentially make use of, but a couple approaches that come to mind-

MLflow

I set up some basic experiment tracking using mlflow in the CCAI repo, and that was fairly straightforward. I think that all we would need is a SQL connection where mlflow could write results. The advantage of mlflow is it comes with nice tools for querying and visualizing metrics.

Dagster

We could also potentially try to use Dagster's metadata feature, but I feel like this would work better if we were using Dagster clould, and it also doesn't come with visualization tools.

rousik · 2023-10-16T23:20:34Z

rousik
Oct 16, 2023
Collaborator

We could consider emitting some summary statistics as a separate data file (e.g. json), that way you could also compare two ETL results side by side, w/o having to rely on a single timeline (e.g. metrics evolving across nightly runs). Additionally, this could be checked in other places too (e.g. as part of fast run when PR is being merged).

In general, having additional precomputed statistics of interest might be a very valuable extension to the overall effort of detecting changes w/ standard diffing techniques (cc @zaneselvans and @jdangerx)

0 replies

zaneselvans · 2023-10-17T21:52:15Z

zaneselvans
Oct 17, 2023
Maintainer

This seems like the kind of thing there must already be solutions for, but I suspect a lot of them are proprietary platforms whose business model is to get you locked in and then crank up prices to extract rents at levels we won't be able to afford because we're not trying to extract rents from anyone else.

Outputting some kind of ETL summary alongside or even inside the SQLite DB that we generate sounds like a good option. If we keep the nightly build outputs for the last 30 days, and also retain & distribute this summary with every tagged release, and we do a tagged release once a month, then we'd have a lot of recent detail, as well as a longer baseline for comparison.

I could imagine setting up an action that runs once a day to load all of those historical summaries into a single SQLite DB in a cloud bucket or BigQuery table that we can use to visualize the evolution of the metrics over time or do point comparisons.

It would be really nice if we could just make this work with Dagster's metadata integration and some off-the-shelf system. I don't think this is a system we can really afford to build and maintain.

0 replies

bendnorman · 2023-10-30T18:00:13Z

bendnorman
Oct 30, 2023
Maintainer

A few thoughts:

Tracking validation metrics feels like a version of data validation test so it would be nice if our data validation framework could integrate with this metric tracking. Even though we don’t have “correct” values or ranges for some metrics, I think we’d still want our data validation system to raise a warning if a metrics drift a significant amount. It’d be nice to have these metric drift warnings and full validation failures tracked by one system.
How would this type of tool intersect with the sqlite differ?
This is a great use case for dagster metadata and asset checks. Asset checks are still an experimental dagster feature but I think they’d be a good fit for our validation needs. @zschira I agree we’d need a full deployment of dagster to view metric changes over time, which means using Dagster Cloud or deploying a full deployment on GCP. It seems like the overhead of maintaining and setting up a full dagster deployment ourselves is worth avoiding the risk of aggressive rent-seeking.
I don’t have any experience with MLFlow, but it seems like the primary tool for this kind of thing. If you agree that metric trends are just a version of a validation test, how could MLFlow integrate with a data validation framework? That’s cool you can query and visualize the metrics within MLFlow. I can look into how this could be done with dagster metadata.

0 replies

zschira · 2023-10-31T19:42:27Z

zschira
Oct 31, 2023
Maintainer Author

I definitely agree that there's a lot of overlap between metrics tracking and validation tests, so creating some kind of integration between the two would be useful. I'm not aware of anything like this out of the box for mlflow, but might be something that could be done with a small layer on top. Not sure exactly what that'd look like, but I think that would probably depend significantly on the specific validation framework and metrics tracking strategy we use.

I think with mlflow that the only necessary cloud infrastructure would be a postgres instance that we could log to, and I'm pretty sure everything else can be run locally. That could be a slight advantage over using dagster's tooling, however if we're going to eventually move towards deploying dagster anyway, then maybe that's not as big of a deal. This is a bit tangential, but I also could also see mlflow becoming useful for storing and retrieving model weights/parameters. This may well become useful for some of the entity matching problems, but also feels like a fairly high infrastructure burden.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Incorporate some form of experiment tracking in nightly builds #2944

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Incorporate some form of experiment tracking in nightly builds #2944

zschira Oct 16, 2023 Maintainer

Background

Tools

MLflow

Dagster

Replies: 4 comments

rousik Oct 16, 2023 Collaborator

zaneselvans Oct 17, 2023 Maintainer

bendnorman Oct 30, 2023 Maintainer

zschira Oct 31, 2023 Maintainer Author

zschira
Oct 16, 2023
Maintainer

rousik
Oct 16, 2023
Collaborator

zaneselvans
Oct 17, 2023
Maintainer

bendnorman
Oct 30, 2023
Maintainer

zschira
Oct 31, 2023
Maintainer Author