Replies: 4 comments
-
We could consider emitting some summary statistics as a separate data file (e.g. json), that way you could also compare two ETL results side by side, w/o having to rely on a single timeline (e.g. metrics evolving across nightly runs). Additionally, this could be checked in other places too (e.g. as part of fast run when PR is being merged). In general, having additional precomputed statistics of interest might be a very valuable extension to the overall effort of detecting changes w/ standard diffing techniques (cc @zaneselvans and @jdangerx) |
Beta Was this translation helpful? Give feedback.
-
This seems like the kind of thing there must already be solutions for, but I suspect a lot of them are proprietary platforms whose business model is to get you locked in and then crank up prices to extract rents at levels we won't be able to afford because we're not trying to extract rents from anyone else. Outputting some kind of ETL summary alongside or even inside the SQLite DB that we generate sounds like a good option. If we keep the nightly build outputs for the last 30 days, and also retain & distribute this summary with every tagged release, and we do a tagged release once a month, then we'd have a lot of recent detail, as well as a longer baseline for comparison. I could imagine setting up an action that runs once a day to load all of those historical summaries into a single SQLite DB in a cloud bucket or BigQuery table that we can use to visualize the evolution of the metrics over time or do point comparisons. It would be really nice if we could just make this work with Dagster's metadata integration and some off-the-shelf system. I don't think this is a system we can really afford to build and maintain. |
Beta Was this translation helpful? Give feedback.
-
A few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I definitely agree that there's a lot of overlap between metrics tracking and validation tests, so creating some kind of integration between the two would be useful. I'm not aware of anything like this out of the box for mlflow, but might be something that could be done with a small layer on top. Not sure exactly what that'd look like, but I think that would probably depend significantly on the specific validation framework and metrics tracking strategy we use. I think with mlflow that the only necessary cloud infrastructure would be a postgres instance that we could log to, and I'm pretty sure everything else can be run locally. That could be a slight advantage over using dagster's tooling, however if we're going to eventually move towards deploying dagster anyway, then maybe that's not as big of a deal. This is a bit tangential, but I also could also see mlflow becoming useful for storing and retrieving model weights/parameters. This may well become useful for some of the entity matching problems, but also feels like a fairly high infrastructure burden. |
Beta Was this translation helpful? Give feedback.
-
Background
While working on replacing the FERC-FERC inter-year plant matching, I've been considering metrics that might be useful to keep track of to see trends overtime or significant changes between builds. The kind of metrics I'm thinking of are things that might be hard to write a good validation test for (i.e. we don't have "correct" value or range of values is), so tracking trends might be more informative. One such metric could be the number of unique FERC plant IDs assigned during matching. If this were to change significantly when we haven't integrated any new data or changed the matching process, we would probably want to be aware of that. I'm sure there are many other similar metrics, and it would be very useful to have a minimal framework of tests and infrastructure that would make it fast/easy to add new metrics to the tracking system, and have them recorded and analyzed during every build.
Tools
I'm sure there are many tools we could potentially make use of, but a couple approaches that come to mind-
MLflow
I set up some basic experiment tracking using
mlflow
in the CCAI repo, and that was fairly straightforward. I think that all we would need is a SQL connection wheremlflow
could write results. The advantage ofmlflow
is it comes with nice tools for querying and visualizing metrics.Dagster
We could also potentially try to use Dagster's metadata feature, but I feel like this would work better if we were using Dagster clould, and it also doesn't come with visualization tools.
Beta Was this translation helpful? Give feedback.
All reactions