Joining the modern data stack with the modern ML stack
As part our TDS series on MLOps, our blog post shows how a post-modern stack works, by deconstructing (see the pun?) our original YDNABB repo into the few fundamental pieces owning the actual compute: a data warehouse for dataOps, and Metaflow on AWS for MLOps. A quick, high-level walk-through of the stack can be found in our intro video:
As a use case, we pick a popular RecSys challenge, session-based recommendation: given the interactions between a shopper and some products in a browsing session, can we train a model to predic what the next interaction will be? The flow is powered by our open-source Coveo Data Challenge dataset - as model, we train a vanilla LSTM, a model just complex enough to make good use of cloud computing. At a quick glance, this is what we are building:
As usual, we show a working, end-to-end, real-world flow: while you can run it locally with few thousands sessions to get the basics, we suggest you to use the MAX_SESSIONS
variable to appreciate how well the stack scales - with no code changes - as millions of events are pushed to the warehouse.
For an in-depth explanation of the philosophy behind the approach, please check the companion blog post, and the previous episodes / repos in the series.
The code is a self-contained recommender project; however, since we leverage best-in-class tools, some preliminary setup is required. Please make sure the requirements are satisfied, depending on what you wish to run - roughly in order of ascending complexity:
The basics: Metaflow, Snowflake and dbt
A Snowflake account is needed to host the data, and a working Metaflow + dbt setup is needed to run the flow; we strongly suggest to run Metaflow on AWS
(as it is the intended setup), but with some minor modifications you should be able to run the flow with a local store as well.
- Snowflake account: sign-up for a free trial.
- AWS account: sign-up for a free AWS account.
- Metaflow on AWS: follow the setup guide.
- dbt core setup: on top of installing the open source package (already included in the
requirements.txt
), you need to point dbt to your Snowflake instance with the proper dbt_profile, Make sure the SCHEMA there matches with what is specified in the.env
file (SF_SCHEMA
).
Adding experiment tracking
- Comet ML: sign-up for free and get an api key. If you don't want experiment tracking, make sure to comment out the Comet specific parts in the
train_model
step.
Adding PaaS deployment
- SageMaker setup: To deploy the model as a PaaS solution using SageMaker, the
IAM_SAGEMAKER_ROLE
parameter in the flow needs to contain a suitable IAM ROLE to deploy an endpoint and access the s3 bucket where Metaflow is storing the model artifact; if you don't wish to deploy your model, run the flow withSAGEMAKER_DEPLOY=0
in the.env
file.
Adding dbt cloud
- dbt cloud account: sign-up for free and get an api key. If you don't wish to use dbt cloud but just the local setup,set
DBT_CLOUD=0
in the.env
file.
Setup a virtual environment with the project dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
NOTE: the current version of RecList has some old dependencies which may results in some (harmless) pip conflicts - conflicts will disappear with the new version, coming out soon.
Create a local version of the local.env
file named only .env
(do not commit it!), and make sure to fill its values properly:
VARIABLE | TYPE | MEANING |
---|---|---|
SF_USER | string | Snowflake user name |
SF_PWD | string | Snowflake password |
SF_ACCOUNT | string | Snowflake account |
SF_DB | string | Snowflake database |
SF_SCHEMA | string (suggested: POST_MODERN_DATA_STACK) | Snowflake schema for raw and transformed data |
SF_TABLE | string (COVEO_DATASET_RAW) | Snowflake table for raw data |
SF_ROLE | string | Snowflake role to run SQL |
APPLICATION_API_KEY | uuid (474d1224-e231-42ed-9fc9-058c2a8347a5) | Organization id to simulate a SaaS company |
MAX_SESSIONS | int (1000) | Number of raw sessions to load into Snowflake (try first running the project locally with a small number) |
EN_BATCH | 0-1 (0) | Enable/disable cloud computing for @batch steps in Metaflow (try first running the project locally) |
COMET_API_KEY | string | Comet ML api key |
DBT_CLOUD | 0-1 (0) | Enable/disable running dbt on the cloud |
SAGEMAKER_DEPLOY | 0-1 (1) | Enable/disable deploying the model artifact to a Sagemaker endpoint |
DBT_ACCOUNT_ID | int | dbt cloud account id (you can find it in the dbt cloud URL) |
DBT_PROJECT_ID | int | dbt cloud project id (you can find it in the dbt cloud URL) |
DBT_JOB_ID | int | dbt cloud job id (you can find it in the dbt cloud URL) |
DBT_API_KEY | string | dbt cloud api key |
Original datasets are from the Coveo SIGIR Data Challenge. To save you from downloading the original data dump and dealing with large text files, we re-used the abstraction over the data provided by RecList. If you run upload_to_snowflake.py
in the upload
folder from your laptop as a one-off script, the program will download the Data Challenge dataset and dump it to a Snowflake table that simulates the append-only log pattern. This allows us to use dbt and Metaflow to run a realistic ELT and ML code over real-world data.
Once you run the script, check your Snowflake for the new schema/table:
If you wish to see how a data ingestion pipeline works (i.e. an endpoint streaming into Snowflake real-time, individual events, instead of a bulk upload), we open-sourced a serverless pipeline as well.
While we will run dbt code as part of Metaflow, it is good practice to try and see if everything works from a stand-alone setup first. To run and test the dbt transformations, just cd
into the dbt
folder and run dbt run --vars '{SF_SCHEMA: POST_MODERN_DATA_STACK, SF_TABLE: COVEO_DATASET_RAW}'
, where the variables reflect the content of your .env
file (you can also run dbt test
, if you like).
Once you run dbt, check your Snowflake for the views:
The DBT_CLOUD
variable (see above) controls whether transformations are run from within the flow folder, or from a dbt cloud account, by using dbt API to trigger the transformation on the cloud platform. If you want to leverage dbt cloud, make sure to manually create a job on the platform, and then configure the relevant variables in the .env
file. In our tests, we used the exact same .sql
and .yml
files that you find in this repository:
Please note that instead of having a local dbt folder, you could have your dbt code in a Github repo and then either clone it using Github APIs at runtime, or import it in dbt cloud and use the platform to run the code base.
Once the above setup steps are completed, you can run the flow:
- cd into the
src
folder; - run the flow with
METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py --package-suffixes ".py" run --max-workers 4
, whereMETAFLOW_PROFILE
is needed to select a specific Metaflow config (you can omit it, if you're using the default),AWS_PROFILE
is needed to select a specific AWS config that runs the flow and it's related AWS infrastructure (you can omit it, if you're using the default), andAWS_DEFAULT_REGION
is needed to specify the target AWS region (you can omit it, if you've it already specified in your local AWS PROFILE and you do not wish to change it); - visualize the performance card with
METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view test_model --id recCard
(see below for an intro to RecList).
If you run the fully-featured flow (i.e. SAGEMAKER_DEPLOY=1
) with the recommended setup, you will end up with:
- an up-to-date view in Snowflake, leveraging dbt to make raw data ready for machine learning;
- versioned datasets and model artifacts in your AWS, accessible through the standard Metaflow client API;
- a Comet dashboard for experiment tracking of the deep learning model, displaying training stats;
- a versioned Metaflow card containing (some of) the tests run with RecList (see below);
- finally, a DL-based, sequential recommender system serving predictions in real-time using SageMaker for inference.
If you log in into your AWS SageMaker interface, you should find the new endpoint for next event prediction available for inference:
If you run the flow with dbt cloud, you will also find the dbt run in the history section on the cloud platform, easily identifiable through the flow id and user.
The project includes a (stub of a) custom DAG card showing how the model is performing according to RecList, our open-source framework for behavioral testing. We could devote an article / paper just to this (as we actually did recently!); you can visualize it with METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view test_model --id recCard
at the end of your run. No matter how small, we wanted to include the card/test as a reminder of how important is to understand model behavior before deployment. Cards are a natural UI to display some of the RecList information: since readable, shareable (self-)documentation is crucial for production, RecList new major release will include out-of-the-box support for visualization and reporting tools: reach out if you're interested!
As a bonus bonus feature (thanks Valay for the snippet!), only when running with the dbt core setup, the (not-production-ready) function get_dag_from_manifest
will read the local manifest file and produce a dictionary compatible with Metaflow Card API. If you type METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view run_transformation --id dbtCard
at the end of a successful run, you should see a card displaying the dbt card as a Metaflow card, as in the image below:
We leave to the reader (and / or to future iterations) to explore how to combine dbt, RecList and other info into a custom, well-designed card!
Of course, the post-modern stack can be further expanded or improved in many ways. Without presumption of completeness, these are some ideas to start:
- on the dataOps side, we could include some data quality checks, either by improving our dbt setup, or by introducing additional tooling: at reasonable scale the greater marginal value is typically to be found in better data, as compared to better models;
- on the MLOps side, we barely scratched the surface: one side, we kept the modeling simple and avoid any tuning, which is however very easy to do using Metaflow built-in parallelization abilities; on the other, you may decide to complicate the flow with other tools, improve on serving etc. (e.g. the proposal here). Swapping in-and-out different tools with similar functionalities should be easy: in a previous work, we abstracted away experiment tracking and allow users to pick Neptune as an alternative SaaS platform. Similar considerations apply to this use case as well;
- a proper RecList for this flow is yet to be developed, as the current proposal is nothing more than a stub showing how easy it is to run a devoted test suite when needed: you can augment the simple suite we prepared, improve the visualization on cards or both - since RecList roadmap is quickly progressing, we expect a deeper integration and a whole new set of functionalities to be announced soon. Stay tuned for our next iteration on this!
Is this the only way to run dbt in Metaflow? Of course not - in particular, you could think of writing a small wrapper around a flow and a dbt-core project that creates individual Metaflow steps corresponding to individual dbt steps, pretty much like suggested here for another orchestrator. But this is surely a story for another repo / time ;-)
Special thanks to Sung Won Chung from dbt Labs, Hugo Bowne-Anderson, Gaurav Bhushan, Savin Goyal, Valay Dave from Outerbounds, Luca Bigon, Andrea Polonioli and Ciro Greco from Coveo.
If you liked this project and the related article, please take a second to add a star to this and our RecList repository!
Contributors:
- Jacopo Tagliabue, general design, Metaflow fan boy, prototype.
- Patrick John Chia, model, deployment and testing.
All the code in this repo is freely available under a MIT License, also included in the project.