This is Stable Diffusion built on pre-trained Stable Diffusion v1.5 weights with Self-Attention Guidelines (SAG) to enhance generated image's stability. It also uses ControlNet, a neural network model, to support additional input to control the image generation. Additionally, the model can add artistic features to the generated image by utilizing trained style weights.
This model is built on Hugging Face modules. It utilizes Tokenizer, Text Encoder, Variational Auto Encoder and Unet model from it.
- Tokenizer - creates tokens with padding to match required length.
- Text Encoder - Get token embedding from tokens and the positional embedding. It is then combined and fed to a transformer model to get the output embedding
- UNet - Takes in noisy latents and predicts the noise residual of the latent shape.
- Variational Autoencoder - Takes in the latents and decodes it into the image space.
Self attention guidelines helps stable diffusion to improve generated image. It uses the intermediate self-attention maps to adversially blur and guides the model. Parameter sag_scale
controls the SAG influence on the model.
ControlNet conditions the diffusion model to learn specific user input conditions (like edges, depth). This helps it generate images which are related to the desired spatial context. canny
and openpose
controlnets are supported in this application. Conditional input image such as edge map, keypoints are also provided along with the controlnet model for inference.
controlnet_cond_scale
parameter controls the scale to which the generated image are faithful to the conditional image.
The application is trained on a novel art via Textual Inversion. In our case, images stylistically related to pop-art are trained in order to associate it with <pop-art>
word within the text encoder embedding. Training images and the weights for style training are available here <pop-art>.
To use the style, add in the prompt. While running the model, enable style_flag
to use the style.
Stable Diffusion can be run in the following two ways-
Clone repository and change directory-
git clone https://github.com/Shashank-Holla/diffusion-controlnet-sag.git
cd diffusion-controlnet-sag/
Install dependencies-
pip install -r requirements.txt
Run model
!python main.py --prompt "Margot Robbie as wonderwoman in style" --seed 3 --batch_size 1 --controlNet_image ./control_images/controlimage_1.jpg --controlNet_type canny --style_flag T --sag_scale 0.75 --controlnet_cond_scale 1.0
This repository is also available as CLI application. Build files are available in dist
folder in this repository. Control Image and style weights path must be absolute. Valid Control Image is required if controlnet model is provided.
Clone repository and change directory-
git clone https://github.com/Shashank-Holla/diffusion-controlnet-sag.git
cd diffusion-controlnet-sag/
Install distribution-
!pip install dist/diffusion-0.0.7-py3-none-any.whl
Run application generate
. Provide input as prompted-
/usr/local/bin/generate
Shared here are few run results by changing the various parameters.
These run results are by varying SAG scale and adding artistic style.
Below is the control image used. Edge map is fed as the conditioning image for stable diffusion.
Control Image | Extracted features for spatial context |
---|---|
Images with pop-art style shows the style did not exist in the base Stable Diffusion model and is added with the new weights and the newly added word <pop-art>
. These images also shows how the model performs when the scale of controlNet conditioning is varied.
The second image has controlnet_cond_scale
of 1.0 and closely follows the edge structure of the conditioning image.
These images have been generated by passing keypoint control image.
Control Image | Extracted features for spatial context |
---|---|