In our rapidly evolving digital age, data engineering has become the backbone of the modern data-driven world. With the ever-increasing volume of data, the ability to process and analyze this data in real time is becoming a necessity rather than a luxury. This project presents a robust real-time e-commerce analytics pipeline using Kafka for data streaming, Spark for processing, and Docker for containerization, complemented by Elasticsearch for data storage and Kibana for visualization. Python serves as our primary scripting language to orchestrate this real-time e-commerce analytics adventure.
- Data Generation: Simulates real-time e-commerce data using the Faker library.
- Kafka Integration: Streams generated data into a Kafka cluster efficiently.
- Spark Processing: Applies real-time processing and analytics with Apache Spark.
- Elasticsearch Storage: Indexes and stores processed data for efficient retrieval.
- Kibana Visualization: Provides insightful visualizations of e-commerce metrics.
- Docker and Docker Compose installed.
- Python 3.x.
- Access to an Elasticsearch instance.
-
Clone the Repository
git clone https://github.com/simardeep1792/Real-Time-E-Commerce-Analytics-Pipeline cd Real-Time-E-Commerce-Analytics-Pipeline
-
Docker Setup
docker --version
-
Starting the Services
chmod +x setup_cluster.sh ./setup_cluster.sh
This script initializes services like Kafka, Spark, Elasticsearch, and others in Docker containers.
Data Generation: Run data_generation.py to start generating and streaming data. Kibana for Visualization: Access Kibana at http://localhost:[Kibana-port] to visualize the data.
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.