This repository contains the code and resources for an end-to-end Housing Price Prediction MLOps project. The project utilizes a variety of tools and technologies to ensure efficient and robust development, deployment, and monitoring of a machine learning model for predicting housing prices.
- Introduction
- Technologies Used
- Project Structure
- Getting Started
- Problem Description
- Dataset Description
- Modeling
- MLOps Pipeline
- Workflow Orchestration
- Monitoring and Visualization
- Contributing
- License
This project aims to predict Nigerian housing prices using a machine learning model. It incorporates various DevOps and MLOps practices to ensure streamlined development, deployment, and monitoring of the model.
-
Visual Studio Code: A powerful and versatile code editor with built-in debugging, version control, and an extensive extension ecosystem.
-
Jupyter Notebook: An interactive, web-based environment for data analysis and scientific computing that supports code, visualizations, and narrative text.
-
PostgreSQL: A powerful open-source relational database management system known for its extensibility, reliability, and advanced features.
-
Python: A widely-used high-level programming language known for its simplicity and readability, commonly used for data manipulation and machine learning.
-
Pandas: A Python library for data manipulation and analysis, providing data structures and functions to efficiently work with structured data.
-
Matplotlib: A comprehensive data visualization library in Python, used to create static, interactive, and animated visualizations.
-
scikit-learn: A machine learning library for Python that provides simple and efficient tools for data mining and data analysis.
-
Flask: A lightweight web application framework in Python, suitable for building web applications and APIs.
-
MLflow: An open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, reproducibility, and deployment.
-
Docker: A platform for developing, shipping, and running applications in containers, ensuring consistency across various environments.
-
Anaconda: A distribution of Python and R programming languages for data science and machine learning, providing a variety of packages and tools.
-
Linux: An open-source operating system kernel widely used for server environments, development, and hosting.
-
Amazon Web Services (AWS): A cloud computing platform offering a wide range of services for computing power, storage, and other functionalities.
-
Grafana: A monitoring and visualization tool used to track metrics, create dashboards, and gain insights from data.
-
Git: A distributed version control system used for tracking changes in code and collaborating with others.
-
Linter for ensuring code quality and adherence to coding standards.
The project has been structured with the following folders and files:
.github:
contains the CI/CD files (GitHub Actions)config:
contains grafana config filesdashboards:
contains json format for monitoring dashboardsdata:
dataset and test sample for testing the modelmodel:
full pipeline from preprocessing to prediction and monitoring using MLflow, Prefect, Grafana, Adminer, and docker-composenotebooks:
EDA and Modeling performed at the beginning of the project to establish a baselinetests:
unit testspyproject.toml:
linting and formattingrequirements.txt:
project requirements
- Clone the repository:
git clone https://github.com/yourusername/your-repo.git
- Set up your environment and install dependencies:
pip install -r requirements.txt
- Follow instructions in relevant sections below to run preprocessing, training, and deployment.
The goal of this project is to develop a machine learning model that can accurately predict housing prices in Nigeria for various types of houses across all 36 states. The model aims to take into account features such as the type of house, location, bedroom size, bathroom size, parking space size, and other relevant factors to make accurate predictions. This predictive model can be a valuable tool for real estate professionals, homeowners, and potential buyers to estimate property values.
The dataset used for this project is a Nigeria House Price Dataset that covers all 36 states of the country. It is located in the data/
directory. It contains the following columns:
- ID: A unique identifier for each property.
- Type of House: The type of the house, such as apartment, duplex, bungalow, etc.
- Location: The location of the property within a specific state.
- Bedroom Size: The number of bedrooms in the house.
- Bathroom Size: The number of bathrooms in the house.
- Parking Space Size: The size of the parking space available.
- Price: The target variable, representing the price of the property.
The model training process is defined in the model_train.py
file. It involves loading the preprocessed data, splitting it into training and validation sets, training a machine learning model and saving the model to model registry. The trained model is saved in the models/
directory.
- Data preprocessing and feature engineering.
- Model training and evaluation.
- Model versioning using MLflow.
- Continuous Integration (CI) using GitHub Actions for code quality checks and tests.
- Continuous Deployment (CD) using GitHub Actions to deploy the model in a containerized environment.
- Workflow orchestration using Prefect to schedule and manage the entire pipeline.
The Workflow Orchestration phase of this project involves managing and automating the various steps of the machine learning pipeline using Prefect Cloud. It ensures that data preprocessing, model training, and deployment occur seamlessly and efficiently. The Prefect workflow orchestration tool is utilized to schedule, coordinate, and monitor these tasks.
visit Prefect Cloud to setup prefect cloud.
```
prefect deployment build main.py:main_run \
-n "main_pipeline" \
-o "main_pipeline" \
--apply
```
Grafana is used to monitor various metrics and insights related to the model's performance, data quality, and more. It provides real-time visualization of key performance indicators and helps in identifying anomalies and trends.
Contributions are welcome! If you would like to contribute to the project, please follow the standard GitHub workflow: fork the repository, create a feature branch, make your changes, and submit a pull request.
This project is licensed under the MIT License.