A fully GKE managed Sentiment analysis application with language model training and inference schemes automated using github actions
- Clone repository.
cd
into cloned directory.- Create and activate a new virtual or miniconda python environment with python 3.10 or latest. E.g. for miniconda:
conda create -n sentiment_analysis_ci_cd conda activate sentiment_analysis_ci_cd
- Install package in develop mode via
pip install -e .[tests]
. - Install
pre-commit
viapre-commit install
.- Optional: Run hooks once on all files via
pre-commit run --all-files
- Make sure to autoupdate
pre-commit autoupdate
- Optional: Run hooks once on all files via
- To upload a dataset make sure you have the data in you local, wandb account project created and then run the below cmd (you can also upload multiple files):
- Run
python sa_app/scripts/wandb_init.py --entity <you_user_name> --project <name of wandb project> --artifact_name <artifact name> --artifact_locations <artifact local path>
- Example
python sa_app/scripts/wandb_init.py --entity bdp_grp2 --project sa-roberta --artifact_name sentiment-dataset --artifact_locations <you can provide multple files separated by a whitespace>
- Run
- To manually download the datasets run the below command with the desired file name:
wandb artifact get prabhupad26/sa-roberta/sentiment-dataset:latest --root training.1600000.processed.noemoticon.csv
If running for the first time follow below steps :
- Download dataset from here
- Run
python -m spacy download en_core_web_sm
cd sa_app/src
- Run this cmd from the root dir of training module (i.e. sa_train/sa_train_module) :
python training/train.py --config <path to train_cfg.yml file>
Run below cmds from the root path of this repo
docker build -t <name the image> .
docker run -p 5000:5000 <name of the image>
For training :
- wildcard -
training/*
if matched will run the model-training workflow
Workflow (via github action) flow diagram :
- Add Starter Code: Initial codebase setup is complete.
- Complete Training Code: Training code for the model is implemented.
- Setup Pre-commit: Pre-commit hooks for code quality are configured.
- Add More Data Preprocessing Steps: Additional data preprocessing steps are in progress.
- Create Three Containers:
- For Training via Flask API: Container setup for training and inferencing is pending.
- For Inferencing via Flask API: Container setup for training and inferencing is pending.
- For Dashboard APP Deployment for GCP App Engine: Container setup for deploying the dashboard app is pending.
- Complete Inference Code in
sa_inference_module
: Flask API for inference is not yet implemented. - Version Control on Dataset and Model: Explore MLFlow integration for dataset and model versioning.
- CI / CD (DevOps): Continuous Integration (CI) and Continuous Deployment (CD) setup is pending.
- Maintain the version of containers in a single file : currently it is being updated in multiple files (.github/workflows and kubernetes-manifest)