This project demonstrates how to set up a local data lake system for extracting, transforming, and loading data, and querying it using a SQL engine. It showcases the integration of various technologies to create a flexible and powerful data processing environment.
- 🐳 Docker: For containerization and easy setup of services
- 🧙♂️ Mage: Data pipeline orchestrator
- ✨ Apache Spark: For data processing and transformations
- 🪣 MinIO: S3-compatible object store serving as our data lake
- 🧊 Apache Iceberg: Advanced table format for data lakes
- 🌟 StarRocks: High-performance analytical database for querying data
- 🐳 Docker and Docker Compose
- 🧠 Basic understanding of data engineering concepts
- 🐍 Familiarity with Python and SQL
- Clone the repository
- Create a
.env
file with MinIO credentials:MINIO_ACCESS_KEY=choose_a_key MINIO_SECRET_KEY=choose_a_secret
- Build the Docker image:
make build
- Start the services:
make up
- Access the Mage UI:
make browse
📄 Dockerfile
: Configures the Mage environment with Spark🐳 docker-compose.yaml
: Defines services (Mage, StarRocks, MinIO)🛠️ Makefile
: Simplifies common commands📋 requirements.txt
: Python dependencies📁 mage_demo/
: Main project directory⚙️ spark-config/
: JAR files for Spark configuration🔧 utils/
: Utility scripts (e.g., Spark session factory)📊 data/
: Directory for storing sample data
- Place your CSV data files in the
mage_demo/data/
directory - Run the Mage pipeline to process and store data in MinIO using Apache Iceberg format
- Use StarRocks to query the data:
- Create an external catalog
- Set the Iceberg catalog
- Use SQL to query and analyze the data
SELECT
neighbourhood,
count(*) as no_reviews
FROM listings
WHERE reviews ='5.0'
GROUP BY 1
ORDER BY COUNT(*) DESC
- 🔍 Add more complex data transformations in Mage pipelines
- 🔄 Implement Delta Lake alongside Apache Iceberg
- 🔬 Explore advanced features of StarRocks for data analysis
Contributions to improve the project are welcome. Please follow the standard fork-and-pull request workflow.
This project is licensed under the MIT License - see the LICENSE file for details.
This project was inspired by the need for a simple, local data lake setup for learning and development purposes.