Skip to content

The definitive end-to-end machine learning (ML lifecycle) guide and tutorial for data engineers.

Notifications You must be signed in to change notification settings

mage-ai/machine_learning

Repository files navigation

TLDR

  1. Define problem
  2. Prepare data
  3. Train and evaluate
  4. Deploy and integrate
  5. Observe
  6. Experiment
  7. Retrain


Setup

  1. Clone the repository: git clone https://github.com/mage-ai/machine_learning.git.

    1. Stay in the same directory that you executed this command in; don’t change directory.
  2. Run Docker:

    docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start machine_learning

    If you don’t use MacOS or Linux, check out other examples in Mage’s quick start guide.

  3. Open a browser and go to http://localhost:6789.


🕵️‍♀️ Define problem

Clearly state the business problem you're trying to solve with machine learning and your hypothesis for how it can be solved.

  1. Open pipeline define_problem.

  2. Define the problem and your hypothesis.

CleanShot.2024-04-14.at.08.42.48.mp4


💾 Prepare data

Collect data from various sources, generate additional training data if needed, and perform feature engineering to transform the raw data into a set of useful input features.

  1. The pipeline core_data_users_v0 contains 3 tables that are joined together.

  2. Pipeline prepare_data is used in multiple other pipeline to perform data preparation on input datasets.

    For example, the ml_training pipeline that’s responsible for training an ML model will first run the above 2 pipelines to build the training set that’s used to train and test the model.

Collecting and combining core user data

CleanShot.2024-04-14.at.09.00.50.mp4

Feature engineering

CleanShot.2024-04-14.at.08.38.15.mp4


🦾 Train and evaluate

Use the training data to teach the machine learning model to make accurate predictions. Evaluate the trained model's performance on a test set.

  1. The ml_training pipeline takes in a training set and trains an XGBoost classifier to predict in what scenarios a user would unsubscribe from a marketing email.

  2. This pipeline will also evaluate the model’s performance on a test data set. It’ll provide visualizations and explain which features are important using SHAP values.

  3. Finally, this pipeline will serialize the model and its weights to disk to be used during the inference phase.

CleanShot.2024-04-14.at.08.47.59.mp4


🤖 Deploy and integrate

Deploy the trained model to a production environment to generate predictions on new data, either in real-time via an API or in batch pipelines. Integrate the model's predictions with other business applications.

  1. Once the model is done training and has been packaged for deployment, before we can use it to make predictions, we’ll need to setup our feature store that’ll serve user features on-demand when making a prediction.

  2. Use the ml_feature_fetching pipeline to prepare the features for each user ahead of time before progressing to the inference phase.

  3. The ml_inference_offline pipeline is responsible for making batch predictions offline on the entire set of users.

  4. The ml_inference_online pipeline serves real-time model predictions and can be interacted with via an API request. Use the ML playground to interact with this model and make online predictions.

Feature store and fetching

CleanShot.2024-04-14.at.08.55.15.mp4

Batch offline predictions

CleanShot.2024-04-14.at.09.09.10.mp4

Real-time online predictions

  1. The pipeline used for online inference is called ml_inference_online.

  2. Before interacting with the online predictions pipeline, you must first create an API trigger for ml_inference_online pipeline. You can follow the general instructions to create an API trigger.

  3. The video below is for the pipeline named ml_playground, which contains no-code UI interactions to make it easy to play around with the online predictions.

CleanShot.2024-04-14.at.12.52.54.mp4


🔭 Observe

Monitor the deployed model's prediction performance, latency, and system health in the production environment.

Example coming soon.


🧪 Experiment

Conduct controlled experiments like A/B tests to measure the impact of the model's predictions on business metrics. Compare the new model's performance to a control model or previous model versions.

Example coming soon.


🏋️ Retrain

Continuously gather new training data and retrain the model periodically to maintain and improve prediction performance.

  1. Every 2 hours, the retraining pipeline named ml_retraining_model will run.

  2. The retraining pipeline triggers the ml_training pipeline if the following contrived condition is met:

    The number of partitions created for the core_data.users_v0 data product is divisible by 4.

CleanShot.2024-04-14.at.13.25.12.mp4


Conclusion

About

The definitive end-to-end machine learning (ML lifecycle) guide and tutorial for data engineers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages