The application is designed to analyze the stock price of a share by utilizing datasets from past 20 years and using live stock market data.
The link to the application : App
The link to the video describing the repository : Video
This repository contains a Flask App
which is also hosted on Azure
for public access. The project makes use of Delta Tables
from Databricks
, a docker image hosted on DockerHub
and utilizes Azure Web App
for deployment. Further, the project uses various APIs
to interact with external websites for getting data to make predictions.
The App can be run by the user locally or can be accessed via the link provided earlier.
The following image represents the architectural diagram
of our project :
This is the landing page of our application which takes two inputs :
- The stock that needs to be predicted.
- The email where the prediction needs to be sent.
The following are the functions of our application:
Information and Predictions of the stock price
Latest articles to make an informed decision before buying a stock
Provides The historical data trends in a candle chart for visual analysis of the stock
Personalized Prediction mail
A. Online: Visit the Link provided earlier and no additional steps are required.
Note: The web app may not be available in case the Azure service is shut due to exhaustion of credits, please use the offline method mentioned below if the Web app is not available.
B. Offline:
-
Clone This repository to your local machine.
-
Perform a
make install
to install all the required packages and libraries -
Enter the following command into the Terminal:
python app.py
-
When done using the app, press CTRL+C to quit the app
- The application is designed around a specific business capability - fetching data from an API and providing real-time and historical stock analysis. This aligns with the principle of single responsibility in microservices architecture.
- The application is designed to be deployed as a containerized application. A Docker Image is available on Dockerhub for the same. This aligns with the principle of infrastructure automation in microservices architecture.
- The application has its own data pipeline, storing data in a Delta Lake in Azure Workspace. This aligns with the principle of decentralized data management in microservices architecture.
- The application is designed to manage a high volume of requests. Load testing was performed using Locust, and the results demonstrated that the application can effectively handle up to 10,000 concurrent users. The maximum response time recorded during this load test was 70,000 milliseconds (or 70 seconds). These results underscore the application's adherence to the principle of elasticity in a microservices architecture, as it can scale to accommodate significant traffic and maintain functionality under heavy load.
This application is designed for robust data engineering tasks leveraging a variety of tools:
-
Pandas for Data Analysis:
Utilizes Pandas for comprehensive data analysis tasks within a Python environment. -
PySpark for Data Wrangling:
Leverages PySpark to efficiently manipulate and transform large-scale datasets. -
Spark SQL for Data Querying and ETL:
Employs Spark SQL for streamlined data querying and performing Extract, Transform, Load (ETL) operations on diverse datasets. -
Delta Lake for Data Storage:
Utilizes Delta Lake for reliable and scalable data storage, ensuring ACID transactions and versioning for data integrity.
This combination of tools offers a powerful suite for managing, analyzing, and processing data at scale, enabling efficient data engineering workflows.
Infrastructure as Code (IaC) is a practice in which the infrastructure setup is written in code files, rather than manually configured. These code files can be version-controlled and reviewed, allowing for easy changes and rapid disaster recovery.
Here's how this project satisfies the Infrastructure as Code requirement:
Dockerization:
We have containerized the application using Docker. The Dockerfile serves as a form of Infrastructure as Code, as it automates the process of setting up the application environment.
Hosting on Azure ACR:
We have used Azure Resource Manager (ARM) templates scripts to automate the deployment of your Docker containers to Azure ACR, this is also a form of Infrastructure as Code.
Data Pipeline Setup:
The setup of the data pipeline (from API to Delta Lake in Azure Workspaces) is automated using Azure Workflows that are scheduled to run daily, this is another example of Infrastructure as Code.
- The load test results are available in the
results
folder. The system performance showed that the median response time increased after ~ 18,000 users
Files in this repository include:
The README.md
file is a markdown file that contains basic information about the repository, what files it contains, and how to consume them
The requirements.txt
file has a list of packages to be installed for runnning the project.
This folder contains all the code files used in this repository - the files named "Test_" will be used for testing and the remaining will define certain functions
This folder contains the HTML templates which will be used by the Flask Application.
- stock_prediction.html - an HTML File containing the landing page view for the app
- email_sent.html - an HTML File containing the view for when the email is sent to the user
The Makefile
contains instructions for installing packages (specified in requirements.txt
), formatting the code (using black formatting), testing the code (running all the sample python code files starting with the term 'Check...' ), and linting the code using pylint
Github actions are used to automate the following processes whenever a change is made to the files in the repository:
install :
installs the packages and libraries mentioned in the requirements.txt
test :
uses pytest to test the python script
format :
uses black to format the python files
lint :
uses ruff to lint the python files
Note -if all the processes run successfully the following output will be visible in github actions:
The .devcontainer
folder mainly contains two files -
Dockerfile
defines the environment variables - essentially it ensures that all collaborators using the repository are working on the same environment to avoid conflicts and version mismatch issuesdevcontainer.json
is a json file that specifies the environment variables including the installed extensions in the virtual environment
Contains CSV and Parquet files used for initial loads and fast acccess
contains additonal files which are used in the README
This is the dockerfile which contains intructions for the Dockerimage construction, this is for the app and is different from the dockerfile present in the .devcontainer folder
Divya Sharma (ds655)
Revanth Chowdary Ganga (rg361)
Udyan Sachdev (us26)
Ayush Gupta (ag758)
Teamwork Reflection
: Please find the teamwork reflection in the teamwork folder in this repository.