In this project we fine-tune a diffusion model on images of Pokémon. The images are annotated by labels. The goal is to have a deployable model that generates Pokémon given a text prompt.
Everyone contributed equal and faily during the whole project! 🙌🙌🙌
Our project has been open-sourced now, if you want to contribute to our project, please follow the following instructions. Have fun coding!
Starting Point Alarm! 🚨 [Back to Top]
Before start to git add
anything related to this repo, please make sure you run the following commands!
# Get the newest version of the repo!
git pull origin main
# install the newest version dependencies!
pip install -r real_requirements.txt
# run the pre-commit hook to check/modify your file you wanna push!
pip install pre-commit
# Alert!!!💥 The following line will check every files in the repo based on the pre-commit hook!
pre-commit run --all-files
# Only want to check one file? Use this command instead!
pre-commit run --files YOUR_FILE_NAME
# Then do the normal procedure 💯
# git add / git commit / git push ...
Please always open a pull request if you want to merge your modification to the repo! 🤗
- Model Training
- Model Preparation
- Cloud Training commands
- Deploy Model Via FastAPI
- Serve Model Locally
- Deploy model via Google Cloud
- Data drifting Check
- Pytorch Lightning Training, Profiling, DDP and Distributed Data Loading
- Model Pruning&Compiling&Quantization
- Run model training locally
- Run model training in a docker container
- Workspace cleaning and garbage collection
Model Training 🌋 [Back to Top]
TL, DR. I just want to train my model! 🤘
Finetune a Stable Diffusion 🔥 [Back to Top]
To finetune a Stable Diffusion Model simply run the following commands:
# Get the repo!
git clone https://github.com/MikeySaw/pokemon_generation/
cd pokemon_generation/
# Get the env!
conda create -n pokemon python==3.11
conda activte pokemon
pip install -r real_requirements.txt
# Get the data and the origin model weights!
dvc pull
# Train the model!
python pokemon_stable_diffusion/sd_finetune.py
To train a DDPM
model from scratch(Backbone of our SD), simply run the following commands.
cd src/modeling/
python train_ddpm_example.py
Alert!!!🚨 You must have a very nice GPU if you want to run the training commands!
Test Stable Diffusion Model with a dummy input [Back to Top]
To test Stable Diffusion Model with a dummy input (already prepared for you!), simply run the following commands:
python pokemon_stable_diffusion/latent_diffusion.py
This will run the dummy training
process based on a dummy image
and a dummy txt
.
You will see the generated images sample_0.png
, if the code is executed correctly.
Alert!🚨 You need to work on a very expensive server if you want to test this code!(at least 24GB RAM)
You may encounter issues when you install requirements.txt
via command line. This would be caused by those following lines inside the requirements.txt
file:
-e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
-e git+https://github.com/openai/CLIP.git@main#egg=clip
You may need to manually install those two packages if the issues persist.
Data Download Part 🚚 [Back to Top]
Please run pip install -r requirements.txt
to install all the dependencies right now, we will use environment.yml
file later.
You need a kaggle.json
file to activate kaggle package and its related commands, for example kaggle --version
.
run the following commands in command line to download zipped images from kaggle website and unzip them:
chmod +x get_images.sh
bash get_images.sh IMAGE_FOLDER.zip DESTINATION_FOLDER
Data Version Control ⚙️ [Back to Top]
run the following commands to test if dvc
is working fine with your enviroment, please pin your dvc
version to 3.50.1
so that we are using the same version not different ones. This will avoid version conflict problems during the dockerfile building phase. We are also going to use Google Cloud Storage as our data remote storage. To do so, simply run the following commands:
# Ignore the first line if you have not installed dvc yet
pip uninstall dvc
pip install dvc==3.50.1
pip install dvc-gs
# test if the dvc is working on your PC/System
dvc pull
Reproduce Dataset creation 🖼️ [Back to Top]
If you want to create a dataset with your own images run the following. This will generate captions for your images, move the images and created jsonl files to their respective train/test/val folders and create a dataset for you. Make sure to have your images in the data/raw
directory:
# generate captions with BLIP2
python src/data/add_data_description.py
# create train/test/val split
python src/data/create_data_splits.py
# create a torch dataset for train/test/val split
python src/data/make_dataset.py
Hydra Test 👾 [Back to Top]
please check the config/
folder for different hyperparameter settings, to add your own experiment hyperparameters, simply add another yaml
file inside the config/
folder, please beware of the required formats of the hyperparameter yaml files, you need to add this \
# @package _global_
at the beginning of your yaml files so that later we can directly change the config files we gonna use from command line like this way:
# change the default hyperparameter values tom values inside the train_1.yaml file
python train.py config=train_1.yaml
The structure of this folder should always looks similar to this one:
├── config
├── default_config.yaml
└── experiments
├── train_1.yaml
└── train_2.yaml
We can change the config settings during the training/sampling in command line, it would be something like this:
python train.py optimizer=sgd
Github Actions & Continuous Integration & Docker Build Workflow 🐝 [Back to Top]
For github actions related file, please check the .github/workflows
, this folder includes all the github actions which will be trigged when we push/pull into our repo, to be more specific about those files, here is a brief introduction about what those files are doing:
the ci.yaml
file would be responsible for continuous integration
operation, trigger this github action file will trigger the tests
folder and all the pytest
files inside this repo.
the lint.yaml
file would be responsible for pre-commit
hook, this hook will check all the formats we want to use for our files inside this repo.
When pull/merge to the github repo, the Google Cloud will automatically trigger the docker image build workflow, the cloudbuild.yaml
dockerfile will build a dockerfile for testing dvc pull
command for getting the data.
Pre-Commit Hook 🕵️ [Back to Top]
To check the detailed configs about the pre-commit
hook, please check the .pre-commit-config.yaml
file. If you are not satisfied with the style we are using, simply change settings inside this file!
Pytest Test ✔️ [Back to Top] <a name="Pytest-test->
To run .py
files related to the pytest
package, simply run the following command:
pytest tests/
this will run all the files inside the tests
folder named as tests_ ...
Wanna add your own pytest
check into the repo? Easy! Simply add a .py
file inside the tests
folder, the file should be named as test_...
, then add libraries and functions inside this file, the function should also be written like:
def test_...(*args, **kwargs):
...
Coverage ⌛ [Back to Top]
To calculate the coverage rate of all the pytest
related tests, simply run the following commands:
coverage run -m pytest tests/
# check the coverage report!
coverage report
Dockerfile Test 🐋 [Back to Top]
please read the test_trainer.dockfile
for more details, this file is used to be a showcase for building everything, aka dvc
&CUDA
&ENTRYPOINT
in one dockerfile.
to make this dockerfile easier to understand, a toy example is added to the src/model/train_example.py
, this is the entrypoint of the dockerfile.
to build and test this toy example dockerfile, simply run the following command:
# build dockerfile
sudo docker build -f test_trainer.dockerfile . -t test_trainer:latest
# test dockerfile
sudo docker run --gpus all -e WANDB_API_KEY=YOUR_WANDB_KEY test_trainer:latest
make sure to replace the YOUR_WANDB_KEY
here with your real wandb personel token!
Dockerfile Building Up commmands 🐳 [Back to Top]
To build the training dockerfile, please run the following commands:
# If you encounter issues, consider use `sudo` before the whole command
docker build -f sd_finetune.dockerfile . -t fd_train:latest
To overwrite an entrypoint of the dockerfile, simply run the following commands:
docker run --gpus all -e WANDB_API_KEY=YOUR_WANDB_TOKEN --entrypoint python fd_train:latest pokemon_stable_diffusion/sd_finetune.py
For MAC A1/A2
chip user, you may consider to use this command if you want to deploy the model on cloud later:
docker build --platform linux/amd64 -f sd_finetune.dockerfile . -t fd_train:latest
To build the data test dockerfile to test if dvc
is working correctly, simply run the following codes:
# If you encounter issues, consider use `sudo` before the whole command
docker build -f dvcdata.dockerfile . -t fd_data:latest
To build upon app.py
and deploy your lovely model on Google Cloud later, simply run the following commands:
docker build -f gcloudrun.dockerfile . -t gcp_test_app:latest
To run the training dockerfile you just build, simply run the following commands:
Alert! 🚨The following dockerfile includes GPU training support, automatical dvc data preparation, and Wandb logging, please make sure you have all the env prepared!
Alert! 🚨The Stable Diffusion fine-tuning needs at least 18 GB RAM GPU to run, use server or consider rent a GPU if you want to run the following dockerfile
docker run --gpus all -e WANDB_API_KEY=YOUR_WANDB_KEY fd_train:latest
Please replace the YOUR_WANDB_KEY
with your own wandb
authorization token, to get your own token, simply click the following link: wandb authorization link, then login and copy paste your own authorization token.
Please do not forget the --gpus all
flags, this will automatically🪄activate your NVIDIA GPU if your machine has one. Enjoy the fast training! 🏄♀️
Before you start to build another (large!) dockerfile, you may consider to check which dockerfile you already have:
docker images
If you find out you accidently built a dockerfile you do not need anymore, run the following command to delete the dockerfile
docker rmi IMAGE_ID
If you encounter issues with deleting the dockerfiles, copy paste the sequence of numbers at the end of your error message, then try the following two commands:
docker rm numbers
# or
docker rmi numbers
# then try to delete the docker images again
docker rmi IMAGE_ID
If --gpus all
flag returns an error with GPU support, you may need to check the following commands:
# check if the nvidia-driver is installed
# go to their website and download the driver if you do not have one already
nvidia-smi
# check if the compiler is correct/cuda tookit is available
nvcc --version
# you may need sudo rights if nvcc command is not recognized by your machine
# sudo apt install nvidia-cuda-toolkit
If the commands before did not solve the error you are encountering, you may need an extra tookit for your dockerfile to run with a GPU support;
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
After running those commands, your dockerfile should now work with GPU support very smoothly!🏎️
Cloud Training commands ☁️ [Back to Top]
To start the cloud training in GCloud Compute Engine with Nvidia GPU support, simply run the following commands to check the available GPUs in different ZONE first:
gcloud compute accelerator-types list
Since we are not going to train the whole model on GCloud Compute Engine engine, we do not need anything more advanced than Nvidia T4, also, it is really hard and expensive to get any GPU besides the T4. Try to run the following command to see if we could successfully create a compute engine with GPU support:
gcloud compute instances create adios1 \
--zone="asia-northeast3-c" \
--image-family="pytorch-latest-gpu" \
--image-project=deeplearning-platform-release \
--accelerator="type=nvidia-tesla-t4,count=1" \
--maintenance-policy TERMINATE
When you successfully created an instance, ssh to the instance to launch your training.
# check the compute instances we created already
gcloud compute instances list
# ssh to the one with GPU support
gcloud beta compute ssh <instance-name>
If there is no enough computation resources for Compute Engine, you will receive an error message like this:
message: The zone 'projects/PROJ_ID/zones/ZONE' does not
have enough resources available to fulfill the request. Try a different zone, or
try again later.
Luckily, we got a GPU from asia-northeast3-c
, let's ssh
to the server and have fun there!
To ssh to the server, simply run the following commands:
gcloud compute ssh --zone "asia-northeast3-c" "adios1" --project "lovely-aurora-423308-i7"
Next, since we are going to train our model on the Google Cloud, please run the following command to install a pre-defined docker image:
# check all the deep learning related pre-defined docker images
gcloud container images list --repository="gcr.io/deeplearning-platform-release"
# check the lovely pytorch with GPU support!
python -c "import torch; print(torch.__version__)"
# check the lovely nvidia-driver we have!
nvidia-smi
Now we have everything prepared already, this would be exactly the same as deploying a model on our own server, simply follow the Train Model
section in this README.md file, happy coding!😊
We have to use Vertex AI if there is no computation resources available at the moment.
we define our training config file in job_config.yaml
, then we will build and push the training docker image into the Artifact Registry
:
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name=pokemon-training-job \
--config=job_config.yaml
Deploy Model Via FastAPI 🧑💻 [Back to Top]
Wanna see an image which should be a pokemon but does not looks like a pokemon at all? 👀 Simply run the following commands!
# Deploy the model locally via FastAPI!
python app.py
You will see from the terminal that our application is already there!
To generate one image based on your prompt, simply go to this link from your browser: http://localhost:8080/docs
, click the try it out
button, the replace the str
into a real prompt, it will generate a pokemon image for you!
Feel angry about why the generated images does not look like a pokemon? 😡 Try the finetuned version! Simply run the following commands to deploy a fine-tuned stable diffusion model locally for your lovely pokemon!
# Deploy a fine-tuned model!
python finetune_app.py
Simply do the same thing as before, then download the generated image, have fun with this pokemon app!🐻
If you want to check the monitoring of the deployed application, simply go to this link:
http://localhost:8080/metrics
Serve Model Locally 👩💻 [Back to Top]
To serve our latent diffusion model locally, simply run the following commands!
torch-model-archiver --model-name latent_diffusion \
--version 1.0 \
--model-file pokemon_stable_diffusion/latent_diffusion.py \
--handler latent_diffusion_handler.py \
--extra-files "conf/ddpm_config.yaml,sd-v1-4-full-ema.ckpt" \
--requirements-file real_requirements.txt
Now we have a latent_diffusion.mar
file, which can be served with torchserve
package, run the following commands to make it work! 🈺
torchserve --start --ncs --model-store localserve --models latent_diffusion.mar --ts-config config.properties
We also offer you a one-step solution for using this torchserve
model, simply run this file and have fun!
python torchserverun.py
Deploy model via Google Cloud🧨 [Back to Top]
To deploy a function via cloud function
, please follow the following steps:
Go to the cloud function
first, then click the create function
button, then for Authentication
, choose Allow unauthenticated invocations
, for the following section Runtime, build, connections and security settings
, change those three choices: Memory Allocated, CPU and Timeout
, then click next
, then change the runtime to python 3.X
, change the requirements.txt
and main.py
content, then click test function
button, if there is no error, simply click the deploy
button at the left corner, then we finish the deployment!
To run your deployed model on Cloud Function via command line, simply run the following command line:
curl -X POST -F "file=@/path/to/your/image.jpg" https://REGION-PROJECT_ID.cloudfunctions.net/predict
# In our case, this command would be:
# curl -X POST -F "file=@IMAGE_PATH" https://us-central1-lovely-aurora-423308-i7.cloudfunctions.net/predict
If you encounter an issue during using the cloud function
commands here, simply run the following commands to preprocess input images first:
# Your images should be saved into a folder
python file_pre.py YOUR_IMAGE_FOLDER_PATH/ YOUR IMAGE_SAVING_PATH
This will fullfill the images requirements for prediction by using the deployed cloud function
To deploy your trained model with trained model weights on Google Cloud, you need to have one Artifact Registry
and enable the Google Cloud Run
service via command line or Cloud console.
Run the following command to enable the Cloud Run service via command line:
gcloud services enable run.googleapis.com
You can actually do everything via command line without going to the Cloud Console, command line is all you need!💯 To build an Artifact Registry then use it for Cloud Deployment, simply run with:
gcloud artifacts repositories create CUSTOM_NAME --repository-format=docker --location=LOCATION --description="DESCRIPTION"
You need to authorize before you start to build and push your cloud deployment dockerfile:
gcloud auth login
gcloud auth configure-docker
gcloud auth configure-docker LOCATION.docker.pkg.dev
# verify you are in the correct project
gcloud config set project YOUR_PROJ_ID
# If you havn't build the dockerfile you want to deploy, run the following commands:
docker build -f gcloudrun.dockerfile . -t gcp_test_app:latest
# In our case: docker tag gcp_test_app us-central1-docker.pkg.dev/lovely-aurora-423308-i7/gcf-artifacts/gcp_test_app
docker tag gcp_test_app LOCATION-docker.pkg.dev/YOUR_PROJ_ID/CUSTOM_NAME/gcp_test_app:latest
# To push the docker image to your Artifact Registry, run this command
# In our case: docker push us-central1-docker.pkg.dev/lovely-aurora-423308-i7/gcf-artifacts/gcp_test_app
docker push LOCATION-docker.pkg.dev/YOUR_PROJ_ID/CUSTOM_NAME/gcp_test_app:latest
After you successfully pushed your images already, run the following commands in terminal to deploy your model on Cloud Run
gcloud run deploy YOUR_SERVICE_NAME \
--image LOCATION-docker.pkg.dev/YOUR_PROJ_ID/CUSTOM_NAME/gcp_test_app \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 32Gi \
--cpu 8 \
In our case, this command would be:
gcloud run deploy latent-diffusion-service \
--image us-central1-docker.pkg.dev/lovely-aurora-423308-i7/gcf-artifacts/gcp_test_app \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 32Gi \
--cpu 8
The terminal should then return a message like this:
Deploying container to Cloud Run service [YOUR_SERVICE_NAME] in project [YOUR_PROJ_ID] region [LOCATION]
A user may always used sudo
command before every commands used before without encountering an issue, however, this will cause severe authorization issues if you try to push your image into your Artifact Registry, you will always encounter authorize issues when you pusn the images:
denied: Permission "artifactregistry.repositories.uploadArtifacts" denied on resource "projects/my-project/locations/LOCATION/repositories/my-repo"
To solve this issue, the following two steps may needed, please run both of them, then login and logout from your PC to make it work.
To avoid using the sudo
again for anything related to docker, please run the following command:
sudo usermod -aG docker $USER
Please click the following link to find out why we need to do this: Cloud Run Guidance, specifically, the following part explained the core idea of this: Note: If you normally run Docker commands on Linux with sudo, Docker looks for Artifact Registry credentials in /root/.docker/config.json instead of $HOME/.docker/config.json.
After remove the sudo
requirements, go to the Cloud Console, or just simply click this link IAM Role, find your own email, then add those roles to your account: Artifact Registry Administrator
, Artifact Registry Writer
. You will have no issue for pushig the images after those two steps!☘️
Data drifting Check [Back to Top]
To check the model robustness torwards data drifting during the image generation, simply run the following commands:
python data_drifting
google-chrome image_drift_report.html
Pytorch Lightning Training, Profiling, DDP and Distributed Data Loading 🏎️ [Back to Top]
To train the model by using the lighting
package, simply run the following command:
python pokemon_stable_diffusion/sd_finetune_pl.py
The lightning
package has one parameter inside the Trainer
for profiler
, simply set it up by Trainer(profiler="simple", ...)
, this will return the profiling report at the end of the training.
To train the model with DDP
strategy, simply add change the ddp
flag inside the argparse as True
, this will activate DDP
training with 2
GPUs activated for training, for data loading, since in all the files the num_workers
related parameter are setted up with value larger than 1
, we are always using the data distributed loading.
Model Pruning&Compiling&Quantization 🪄 [Back to Top]
To get a "smaller" version model with model compiling, simply run the following commands:
python pruning.py
Since pytorch
entered 2.0
era, you can accelerate your model training/inference time by simply calling one line of code:
torch.compile(model)
This works like a free gift and will accelerate the speed by 20 to 30 percent. For Quantization, simply add this trick to your code:
# run faster
tf32 = True
torch.backends.cudnn.allow_tf32 = bool(tf32)
torch.backends.cuda.matmul.allow_tf32 = bool(tf32)
torch.set_float32_matmul_precision('high' if tf32 else 'highest')
This could accelerate your training/inference speed up to 50 percent.
Run model training locally [Back to Top]
To run training locally use:
python -u src/modeling/training.py hydra.job.chdir=False
Specifying "hydra.job.chdir=False" is necessary because hydra changes the working directory by default (this is something we do not want).
Run model training in a docker container [Back to Top]
To run the model training script src/modeling/training.py in a reproducible docker container first build an image using the following command:
docker build -f dockerfiles/OLD_training.dockerfile . -t training:<image_tag>
Then run the training script in a container using:
docker run --gpus all --rm \
-v $(pwd)/data:/wd/data `# mount the data folder` \
-v $(pwd)/models:/wd/models `# mount the model folder` \
-v $(pwd)/conf:/wd/conf `# mount the config file folder` \
-v $(pwd)/hydra_logs/training_outputs:/wd/outputs `# mount the hydra logging folder` \
-v $(pwd)/wandb:/wd/wandb `# mount the wandb outputs folder` \
-v $(pwd)/lightning_logs:/wd/lightning_logs `# mount the lightning outputs folder` \
--name <container_name> \
training:<image_tag> \
paths.model_name=model0 \
paths.training_data=data/processed/pokemon.pth
(The option "hydra.job.chdir=False" is already specified in the image and need not be explicitly added.)
Workspace cleaning and garbage collection [Back to Top]
To remove a docker image run the following:
docker rmi <image_name>:<image_tag>
To run docker garbage collection run the following:
docker system prune -f
To delete all unused images (warning) and run docker garbage collection run the following:
docker system prune -af