Skip to content

An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.

Notifications You must be signed in to change notification settings

arv-anshul/yt-comment-sentiment

Repository files navigation

YouTube Comment Sentiment

An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.

Overview

This project focuses on building a sentiment analysis system for YouTube comments, complete with a FastAPI-based inference endpoint and insights-providing API endpoints. The development process included robust experimentation, tracking, and pipeline reproduction (using MLFlow and DVC).

diagram

Key Features

  • Inference Endpoint: Built using the FastAPI framework to classify sentiment of comments.
  • Insights Endpoints: Additional APIs to provide analytics around comment sentiments.
  • Experiment Tracking: Leveraged MLFlow for tracking experiments.
  • Pipeline Reproduction: Utilized DVC (Data Version Control) for reproducibility.
  • Text Vectorization: Used TfidfVectorizer for transforming text data into feature vectors.
  • Model Selection: Experimented with various models and selected HistGradientBoostingClassifier as the best-performing classifier.

Experimentation

The experimentation phase focused on optimizing hyperparameters for the TfidfVectorizer and HistGradientBoostingClassifier model. Below is a screenshot showcasing how different hyperparameter combinations impacted accuracy:

Experiment Results

Tech Stack

Tech Stack
Programming Language Python
Data Handling Polars
Frameworks and Tools MLflow DVC FastAPI
Machine Learning Models scikit-learn NLTK
Project Dev Tools uv pre-commit Ruff Zed Loguru
Frontend CSS HTML5 JavaScript pnpm shadcn/ui Tailwind CSS Vite Vue.js

More to do!

  1. Merge both classifier model and vectorizer model which reduce the complexity of loading them using using MLFLOW_RUN_ID in app.py.
  2. After completing previous step, load model using MLFLOW_MODEL_URI env instead of MLFLOW_RUN_ID env.
  3. ⚠️ Try to use MLproject file to run ML Pipeline steps instead of dvc.yaml file. (Only if Possible)
    • Also investigate the use dvc here and try to know WHY, WHAT and HOW (part of it).
  4. Know the clear distinction and involvement between the source code of ML Pipeline, Backend.

Important

Feel free to explore and contribute!