An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.
This project focuses on building a sentiment analysis system for YouTube comments, complete with a FastAPI-based inference endpoint and insights-providing API endpoints. The development process included robust experimentation, tracking, and pipeline reproduction (using MLFlow and DVC).
- Inference Endpoint: Built using the FastAPI framework to classify sentiment of comments.
- Insights Endpoints: Additional APIs to provide analytics around comment sentiments.
- Experiment Tracking: Leveraged MLFlow for tracking experiments.
- Pipeline Reproduction: Utilized DVC (Data Version Control) for reproducibility.
- Text Vectorization: Used
TfidfVectorizer
for transforming text data into feature vectors. - Model Selection: Experimented with various models and selected
HistGradientBoostingClassifier
as the best-performing classifier.
The experimentation phase focused on optimizing hyperparameters for the TfidfVectorizer
and
HistGradientBoostingClassifier
model. Below is a screenshot showcasing how different hyperparameter combinations
impacted accuracy:
Tech | Stack |
---|---|
Programming Language | |
Data Handling | |
Frameworks and Tools | |
Machine Learning Models | |
Project Dev Tools | |
Frontend |
- Merge both classifier model and vectorizer model which reduce the complexity of loading them using using
MLFLOW_RUN_ID
inapp.py
. - After completing previous step, load model using
MLFLOW_MODEL_URI
env instead ofMLFLOW_RUN_ID
env. -
⚠️ Try to useMLproject
file to run ML Pipeline steps instead ofdvc.yaml
file. (Only if Possible)- Also investigate the use
dvc
here and try to know WHY, WHAT and HOW (part of it).
- Also investigate the use
- Know the clear distinction and involvement between the source code of ML Pipeline, Backend.
Important
Feel free to explore and contribute!