The definitive end-to-end machine learning (ML lifecycle) guide and tutorial for data engineers

TLDR

Define problem
Prepare data
Train and evaluate
Deploy and integrate
Observe
Experiment
Retrain

Setup

Clone the repository: git clone https://github.com/mage-ai/machine_learning.git.
1. Stay in the same directory that you executed this command in; don’t change directory.

Run Docker:

docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start machine_learning

If you don’t use MacOS or Linux, check out other examples in Mage’s quick start guide.

Open a browser and go to http://localhost:6789.

🕵️‍♀️ Define problem

Clearly state the business problem you're trying to solve with machine learning and your hypothesis for how it can be solved.

Open pipeline define_problem.
Define the problem and your hypothesis.

CleanShot.2024-04-14.at.08.42.48.mp4

💾 Prepare data

Collect data from various sources, generate additional training data if needed, and perform feature engineering to transform the raw data into a set of useful input features.

The pipeline core_data_users_v0 contains 3 tables that are joined together.
Pipeline prepare_data is used in multiple other pipeline to perform data preparation on input datasets.

For example, the ml_training pipeline that’s responsible for training an ML model will first run the above 2 pipelines to build the training set that’s used to train and test the model.

Collecting and combining core user data

CleanShot.2024-04-14.at.09.00.50.mp4

Feature engineering

CleanShot.2024-04-14.at.08.38.15.mp4

🦾 Train and evaluate

Use the training data to teach the machine learning model to make accurate predictions. Evaluate the trained model's performance on a test set.

The ml_training pipeline takes in a training set and trains an XGBoost classifier to predict in what scenarios a user would unsubscribe from a marketing email.
This pipeline will also evaluate the model’s performance on a test data set. It’ll provide visualizations and explain which features are important using SHAP values.
Finally, this pipeline will serialize the model and its weights to disk to be used during the inference phase.

CleanShot.2024-04-14.at.08.47.59.mp4

🤖 Deploy and integrate

Deploy the trained model to a production environment to generate predictions on new data, either in real-time via an API or in batch pipelines. Integrate the model's predictions with other business applications.

Once the model is done training and has been packaged for deployment, before we can use it to make predictions, we’ll need to setup our feature store that’ll serve user features on-demand when making a prediction.
Use the ml_feature_fetching pipeline to prepare the features for each user ahead of time before progressing to the inference phase.
The ml_inference_offline pipeline is responsible for making batch predictions offline on the entire set of users.
The ml_inference_online pipeline serves real-time model predictions and can be interacted with via an API request. Use the ML playground to interact with this model and make online predictions.

Feature store and fetching

CleanShot.2024-04-14.at.08.55.15.mp4

Batch offline predictions

CleanShot.2024-04-14.at.09.09.10.mp4

Real-time online predictions

The pipeline used for online inference is called ml_inference_online.
Before interacting with the online predictions pipeline, you must first create an API trigger for ml_inference_online pipeline. You can follow the general instructions to create an API trigger.
The video below is for the pipeline named ml_playground, which contains no-code UI interactions to make it easy to play around with the online predictions.

CleanShot.2024-04-14.at.12.52.54.mp4

🔭 Observe

Monitor the deployed model's prediction performance, latency, and system health in the production environment.

Example coming soon.

🧪 Experiment

Conduct controlled experiments like A/B tests to measure the impact of the model's predictions on business metrics. Compare the new model's performance to a control model or previous model versions.

Example coming soon.

🏋️ Retrain

Continuously gather new training data and retrain the model periodically to maintain and improve prediction performance.

Every 2 hours, the retraining pipeline named ml_retraining_model will run.
The retraining pipeline triggers the ml_training pipeline if the following contrived condition is met:

The number of partitions created for the core_data.users_v0 data product is divisible by 4.

CleanShot.2024-04-14.at.13.25.12.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.ssh_tunnel		.ssh_tunnel
charts		charts
conditionals		conditionals
custom		custom
data_exporters		data_exporters
data_loaders		data_loaders
dbt		dbt
extensions		extensions
interactions		interactions
markdowns		markdowns
pipelines		pipelines
scratchpads		scratchpads
sensors		sensors
transformers		transformers
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
design.yaml		design.yaml
global_data_products.yaml		global_data_products.yaml
io_config.yaml		io_config.yaml
metadata.yaml		metadata.yaml
requirements.txt		requirements.txt
test2.py		test2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The definitive end-to-end machine learning (ML lifecycle) guide and tutorial for data engineers

TLDR

Setup

🕵️‍♀️ Define problem