Modern Data Stack (MDS) in a box

Table of Content (ToC)

Overview
References
Articles
- The SwirlAI data engineering project
Frameworks
- Minio
- LakeFS
- PostgreSQL
- DuckDB
End-to-end projects

Overview

This project intends to gather referential material and to give practical hints on how to reproduce locally all-in-one modern data stack (MDS).

Among other use cases, we may think of training, onboarding newcomers, testing/benchmarking some new components (e.g., Delta Lake vs Iceberg vs Hudi), LakeFS.

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

Articles

The SwirlAI data engineering project

Title: The SwirlAI data engineering project
Author: Aurimas Griciūnas (Aurimas Griciūnas on LinkedIn, Aurimas Griciūnas on Substck)
Date: July 2023
Link on Substack: https://www.newsletter.swirlai.com/p/the-swirlai-data-engineering-project

Frameworks

Minio

See also GitHub - Data Engineering Helpers - Knowledge Sharing - Minio

LakeFS

PostgreSQL

See also GitHub - Data Engineering Helpers - Knowledge Sharing - PostgreSQL

DuckDB

Reference: https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html
See also GitHub - Data Engineering Helpers - Knowledge Sharing - DuckDB
DuckDB home page: https://duckdb.org/
- Why DuckDB: https://duckdb.org/why_duckdb
DuckDB project on GitHub: https://github.com/duckdb/duckdb

End-to-end projects

Building an End-to-end MLOps Project with Databricks

Article: https://medium.com/marvelous-mlops/building-an-end-to-end-mlops-project-with-databricks-8cd9a85cc3c0
Author: Benito Martin (Benito Martin on LinkedIn, Benito Martin on Medium)

This blog post details a capstone project using Databricks for MLOps. It covers the end-to-end process of deploying a machine learning model, from data preprocessing and feature engineering to model monitoring and continuous integration/continuous deployment (CI/CD). Key learnings include:

Databricks for MLOps: Using Databricks for data preprocessing, feature engineering, model training, and deployment.
Feature Store: Leveraging Databricks Feature Store for consistent feature computation.
MLflow Tracking: Tracking experiments, logging parameters and metrics, and ensuring reproducibility.
Model Serving: Exploring different model serving architectures for efficient deployment.
A/B Testing: Implementing A/B testing for model comparison and performance-based routing.
Databricks Asset Bundles: Managing projects with Infrastructure-as-Code (IaC) principles.
Monitoring and Drift Detection: Setting up model monitoring, tracking metrics, and detecting drift.
CI/CD: Implementing CI/CD workflows for continuous model validation and deployment.
Scalability: Scaling models for production and real-time serving.

Bike sharing

Post on LinkedIn: https://www.linkedin.com/posts/pau-labarta-bajo-4432074b_machinelearning-mlops-realworldml-ugcPost-7265607334830256128-ZkrV/
- Authors: Paul Labarta Bajo and Javier Yanzon (Paul Labarta Bajo on LinkedIn, Javier Yanzon on LinkedIn)
GitHub repository: https://github.com/javieryanzon/bike_sharing_demand_predictor

Car price predictor

Article: https://medium.com/towards-data-engineering/predicting-car-prices-with-fastapi-streamlit-mlflow-kafka-and-debezium-a-practical-7084d5673c0e
Author: Stefen Taime (Stefen Taime on LinkedIn)
GitHub repository: https://github.com/Stefen-Taime/car-price-predictor

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern Data Stack (MDS) in a box

Table of Content (ToC)

Overview

References

Articles

The SwirlAI data engineering project

Frameworks

Minio

LakeFS

PostgreSQL

DuckDB

End-to-end projects

Building an End-to-end MLOps Project with Databricks

Bike sharing

Car price predictor

About

Releases

Packages

License

data-engineering-helpers/mds-in-a-box

Folders and files

Latest commit

History

Repository files navigation

Modern Data Stack (MDS) in a box

Table of Content (ToC)

Overview

References

Articles

The SwirlAI data engineering project

Frameworks

Minio

LakeFS

PostgreSQL

DuckDB

End-to-end projects

Building an End-to-end MLOps Project with Databricks

Bike sharing

Car price predictor

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages