🌊 Local Data Lake Project

📚 Overview

This project demonstrates how to set up a local data lake system for extracting, transforming, and loading data, and querying it using a SQL engine. It showcases the integration of various technologies to create a flexible and powerful data processing environment.

🛠️ Technologies Used

🐳 Docker: For containerization and easy setup of services
🧙‍♂️ Mage: Data pipeline orchestrator
✨ Apache Spark: For data processing and transformations
🪣 MinIO: S3-compatible object store serving as our data lake
🧊 Apache Iceberg: Advanced table format for data lakes
🌟 StarRocks: High-performance analytical database for querying data

📋 Prerequisites

🐳 Docker and Docker Compose
🧠 Basic understanding of data engineering concepts
🐍 Familiarity with Python and SQL

🚀 Setup

Clone the repository

Create a .env file with MinIO credentials:

MINIO_ACCESS_KEY=choose_a_key
MINIO_SECRET_KEY=choose_a_secret

Build the Docker image:
```
make build
```
Start the services:
```
make up
```
Access the Mage UI:
```
make browse
```

📁 Project Structure

📄 Dockerfile: Configures the Mage environment with Spark
🐳 docker-compose.yaml: Defines services (Mage, StarRocks, MinIO)
🛠️ Makefile: Simplifies common commands
📋 requirements.txt: Python dependencies
📁 mage_demo/: Main project directory
- ⚙️ spark-config/: JAR files for Spark configuration
- 🔧 utils/: Utility scripts (e.g., Spark session factory)
- 📊 data/: Directory for storing sample data

🔧 Usage

Place your CSV data files in the mage_demo/data/ directory
Run the Mage pipeline to process and store data in MinIO using Apache Iceberg format
Use StarRocks to query the data:
- Create an external catalog
- Set the Iceberg catalog
- Use SQL to query and analyze the data

💡 Example Query

SELECT 
  neighbourhood, 
  count(*) as no_reviews
FROM listings 
WHERE reviews ='5.0'
GROUP BY 1
ORDER BY COUNT(*) DESC

🚀 Extending the Project

🔍 Add more complex data transformations in Mage pipelines
🔄 Implement Delta Lake alongside Apache Iceberg
🔬 Explore advanced features of StarRocks for data analysis

🤝 Contributing

Contributions to improve the project are welcome. Please follow the standard fork-and-pull request workflow.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

This project was inspired by the need for a simple, local data lake setup for learning and development purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
mage_demo		mage_demo
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌊 Local Data Lake Project

📚 Overview

🛠️ Technologies Used

📋 Prerequisites

🚀 Setup

📁 Project Structure

🔧 Usage

💡 Example Query

🚀 Extending the Project

🤝 Contributing

📜 License

🙏 Acknowledgements

About

Releases

Packages

Languages

lajwithsingh/mageLocalDataPipeline

Folders and files

Latest commit

History

Repository files navigation

🌊 Local Data Lake Project

📚 Overview

🛠️ Technologies Used

📋 Prerequisites

🚀 Setup

📁 Project Structure

🔧 Usage

💡 Example Query

🚀 Extending the Project

🤝 Contributing

📜 License

🙏 Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages