Created by gh-md-toc
This project intends to gather referential material and to give practical hints on how to reproduce locally all-in-one modern data stack (MDS).
Among other use cases, we may think of training, onboarding newcomers, testing/benchmarking some new components (e.g., Delta Lake vs Iceberg vs Hudi), LakeFS.
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Architecture principles for data engineering pipelines on the Modern Data Stack (MDS)
- Specifications/principles for a
data engineering pipeline deployment tool
dpcctl
, the Data Processing Pipeline (DPP) CLI utility, a Minimal Viable Product (MVP) in Go
- Material for the Data platform - Data-lakes, data warehouses, data lake-houses
- Material for the Data platform - Data life cycle
- Material for the Data platform - Data contracts
- Material for the Data platform - Metadata
- Material for the Data platform - Data quality
- Material for the Data platform - Cheat sheets
- Title: The SwirlAI data engineering project
- Author: Aurimas Griciūnas (Aurimas Griciūnas on LinkedIn, Aurimas Griciūnas on Substck)
- Date: July 2023
- Link on Substack: https://www.newsletter.swirlai.com/p/the-swirlai-data-engineering-project
- Reference: https://lakefs.io/blog/the-docker-everything-bagel-spin-up-a-local-data-stack/
- See also GitHub - Data Engineering Helpers - Knowledge Sharing - LakeFS
- Reference: https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html
- See also GitHub - Data Engineering Helpers - Knowledge Sharing - DuckDB
- DuckDB home page: https://duckdb.org/
- Why DuckDB: https://duckdb.org/why_duckdb
- DuckDB project on GitHub: https://github.com/duckdb/duckdb
- Article: https://medium.com/marvelous-mlops/building-an-end-to-end-mlops-project-with-databricks-8cd9a85cc3c0
- Author: Benito Martin (Benito Martin on LinkedIn, Benito Martin on Medium)
This blog post details a capstone project using Databricks for MLOps. It covers the end-to-end process of deploying a machine learning model, from data preprocessing and feature engineering to model monitoring and continuous integration/continuous deployment (CI/CD). Key learnings include:
- Databricks for MLOps: Using Databricks for data preprocessing, feature engineering, model training, and deployment.
- Feature Store: Leveraging Databricks Feature Store for consistent feature computation.
- MLflow Tracking: Tracking experiments, logging parameters and metrics, and ensuring reproducibility.
- Model Serving: Exploring different model serving architectures for efficient deployment.
- A/B Testing: Implementing A/B testing for model comparison and performance-based routing.
- Databricks Asset Bundles: Managing projects with Infrastructure-as-Code (IaC) principles.
- Monitoring and Drift Detection: Setting up model monitoring, tracking metrics, and detecting drift.
- CI/CD: Implementing CI/CD workflows for continuous model validation and deployment.
- Scalability: Scaling models for production and real-time serving.
- Post on LinkedIn: https://www.linkedin.com/posts/pau-labarta-bajo-4432074b_machinelearning-mlops-realworldml-ugcPost-7265607334830256128-ZkrV/
- Authors: Paul Labarta Bajo and Javier Yanzon (Paul Labarta Bajo on LinkedIn, Javier Yanzon on LinkedIn)
- GitHub repository: https://github.com/javieryanzon/bike_sharing_demand_predictor