Stable Diffusion based on MindSpore

Introduction

This repository integrates state-of-the-art Stable Diffusion models including SD1.5, SD2.0, and SD2.1, supporting various generation tasks and pipelines. Efficient training and fast inference are implemented based on MindSpore.

New models and features will be continuously updated.

Supported Models and Pipelines

SD Model	Text-to-Image	Image Variation	Inpainting	Depth-to-Image	ControlNet	T2I Adapter
1.5	Inference \| Training	N.A.	N.A.	N.A.	Inference \| Training	Inference
2.0 & 2.1	Inference \| Training	Inference \| Training	Inference	Inference	N.A.	Inference \| Training
wukong	Inference \| Training	N.A.	Inference	N.A.	N.A.	N.A.

Although some combinations are not supported currently (due to the lack of checkpoints pretrained on the specific task and SD model), you can use the Model Conversion tool to convert the checkpoint (e.g. from HF) then adapt it to the existing pipelines (e.g. image variation pipeline with SD 1.5)

You may click the link in the table to access the running instructions directly.

For model performance, please refer to benchmark.

Installation

Supported Platforms & Versions

Our code is mainly developed and tested on Ascend 910 platforms with MindSpore framework. The compatible framework versions that are well-tested are listed as follows.

Ascend	MindSpore	CANN	driver	Python	MindONE
910	2.0	6.3 RC1	23.0.rc1	3.7.16	master (4c33849)
910	2.1	6.3 RC2	23.0.rc2	3.9.18	master (4c33849)
910*	2.2.1 (20231124)	7.1	23.0.rc3.6	3.7.16	master (4c33849)

For detailed instructions to install CANN and MindSpore, please refer to the official webpage MindSpore Installation.

Note: Running on other platforms (such as GPUs) and MindSpore versions may not be reliable. It's highly recommended to use the verified CANN and MindSpore versions. More compatible versions will be continuously updated.

Dependency

pip install -r requirements.txt

Install from Source

git clone https://github.com/mindspore-lab/mindone.git
cd mindone/examples/stable_diffusion_v2

Dataset Preparation

This section describes the data format and protocol for diffusion model training.

The text-image pair dataset should be organized as follows.

data_path
├── img1.jpg
├── img2.jpg
├── img3.jpg
└── img_txt.csv

, where img_txt.csv is the image-caption file annotated in the following format.

dir,text
img1.jpg,a cartoon character with a potted plant on his head
img2.jpg,a drawing of a green pokemon with red eyes
img3.jpg,a red and white ball with an angry look on its face

The first column is the image path related to the data_path and the second column is the corresponding prompt.

For convenience, we have prepared two public text-image datasets obeying the above format.

pokemon-blip-caption dataset, containing 833 pokemon-style images with BLIP-generated captions.
Chinese-art blip caption dataset, containing 100 chinese art-style images with BLIP-generated captions.

To use them, please download pokemon_blip.zip and chinese_art_blip.zip from the openi dataset website. Then unzip them on your local directory, e.g. ./datasets/pokemon_blip.

Text-to-Image

Inference

Preparing Pretrained Weights

To generate images by providing a text prompt, please download one of the following checkpoints and put it in models folder:

SD Version	Lang.	MindSpore Checkpoint	Ref. Official Model	Resolution
1.5	EN	sd_v1.5-d0ab7146.ckpt	stable-diffusion-v1-5	512x512
1.5-wukong	CN	wukong-huahua-ms.ckpt	N.A.	512x512
2.0	EN	sd_v2_base-57526ee4.ckpt	stable-diffusion-2-base	512x512
2.0-v	EN	sd_v2_768_v-e12e3a9b.ckpt	stable-diffusion-2	768x768
2.1	EN	sd_v2-1_base-7c8d09ce.ckpt	stable-diffusion-2-1-base	512x512
2.1-v	EN	sd_v2-1_768_v-061732d1.ckpt	stable-diffusion-2-1	768x768

Take SD 1.5 for example:

cd examples/stable_diffusion_v2
wget https://download.mindspore.cn/toolkits/mindone/stable_diffusion/sd_v1.5-d0ab7146.ckpt -P models

Text-to-Image Generation

After preparing the pretrained weight, you can run text-to-image generation by:

python text_to_image.py --prompt {text prompt} -v {model version}

-v: model version. Valid values can be referred to SD Version in the above table.

For more argument illustration, please run python text_to_image.py -h.

Take SD 1.5 as an example:

# Generate images with the provided prompt using SD 1.5
python text_to_image.py --prompt "elven forest" -v 1.5

Take SD 2.0 as an example:

# Use SD 2.0 instead and add negative prompt guidance to eliminate artifacts
python text_to_image.py --prompt "elven forest" -v 2.0 --negative_prompt "moss" --scale 9.0 --seed 42

Inference with different samplers

By default, the inference use dpm++ 2M samplers. You can use others if needed. The support list and detailed illustrations refer to schedulers.

Distributed Inference

For parallel inference, take SD1.5 on the Chinese art dataset as an example:

mpirun --allow-run-as-root -n 2 python text_to_image.py \
     --config "configs/v1-inference.yaml" \
     --data_path "datasets/chinese_art_blip/test/prompts.txt" \
     --output_path "output/chinese_art_inference/txt2img" \
     --ckpt_path "models/sd_v1.5-d0ab7146.ckpt" \
     --use_parallel True

Note: Parallel inference only can be used for mutilple-prompt.

Long Prompts Support

By Default, SD V2(1.5) only supports the token sequence no longer than 77. For those sequences longer than 77, they will be truncated to 77, which can cause information loss.

To avoid information loss for long text prompts, we can divide one long tokens sequence (N>77) into several shorter sub-sequences (N<=77) to bypass the constraint of context length of the text encoders. This feature is supported by args.support_long_prompts in text_to_image.py.

When running inference with text_to_image.py, you can set the arguments as below.

python text_to_image.py \
...  \  # other arguments configurations
--support_long_prompts True \  # allow long text prompts

Flash-Attention Support

MindONE supports flash attention by setting the argument enable_flash_attention as True in configs/v1-inference.yaml or configs/v2-inference.yaml. For example, in configs/v1-inference.yaml:

    unet_config:
    target: ldm.modules.diffusionmodules.openaimodel.UNetModel
    params:
      ...
      enable_flash_attention: False
      fa_max_head_dim: 256  # max head dim of flash attention. In case of oom, reduce it to 128

One can set enable_flash_attention to True. In case of OOM (out of memory) error, please reduce the fa_max_head_dim to 128.

Here are some generation results.

Prompt: "elven forest" With negative prompt: "moss"

Training

Vanilla fine-tuning refers to training the whole UNet while freezing the CLIP-TextEncoder and VAE modules in the SD model.

To run vanilla fine-tuning, we will use the train_text_to_image.py script following the instructions below.

Prepare the pretrained checkpoint referring to pretrained weights
Prepare the training dataset referring to Dataset Preparation.
Select a training configuration template from config/train and specify the --train_config argument. The selected config file should match the pretrained weight.
- For SD1.5, use configs/train/train_config_vanilla_v1.yaml
- For SD2.0 or SD2.1, use configs/train/train_config_vanilla_v2.yaml
- For SD2.x with v-prediction, use configs/train/train_config_vanilla_v2_vpred.yaml
Note that the model architecture (defined via model_config) and training recipes are preset in the yaml file. You may edit the file to adjust hyper-parameters like learning rate, training epochs, and batch size for your task.

Launch the training script after specifying the data_path, pretrained_model_path, and train_config arguments.

python train_text_to_image.py \
    --train_config {path to pre-defined training config yaml} \
    --data_path {path to training data directory} \
    --output_path {path to output directory} \
    --pretrained_model_path {path to pretrained checkpoint file}

Please enable INFNAN mode by export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" for Ascend 910* if overflow found.

Take fine-tuning SD1.5 on the Pokemon dataset as an example:

python train_text_to_image.py \
    --train_config "configs/train/train_config_vanilla_v1.yaml" \
    --data_path "datasets/pokemon_blip/train" \
    --output_path "output/finetune_pokemon/txt2img" \
    --pretrained_model_path "models/sd_v1.5-d0ab7146.ckpt"

The trained checkpoints will be saved in {output_path}.

For more argument illustration, please run python train_text_to_image.py -h.

Distributed Training

For parallel training on multiple Ascend NPUs, please refer to the instructions below.

Generate the rank table file for the target Ascend server.
```
python tools/hccl_tools/hccl_tools.py --device_num="[0,8)"
```
--device_num specifies which cards to train on, e.g. "[4,8)"

A json file e.g. hccl_8p_10234567_127.0.0.1.json will be generated in the current directory after running.
Edit the distributed training script scripts/run_train_distributed.sh to specify
1. rank_table_file with the path to the rank table file generated in step 1,
2. data_path, pretrained_model_path, and train_config according to your task.
Launch the distributed training script by
```
bash scripts/run_train_distributed.sh
```
Please enable INFNAN mode by export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" for Ascend 910* if overflow found.

After launched, the training process can be traced by running tail -f ouputs/train_txt2img/rank_0/train.log.

The trained checkpoints will be saved in ouputs/train_txt2img.

Note: For distributed training on large-scale datasets such as LAION, please refer to LAION Dataset Preparation.

LoRA Fine-tuning

Low-Rank Adaptation (LoRA) is a parameter-efficient finetuning method for large models.

Please refer to the tutorial of LoRA for Stable Diffusion Finetuning for detailed instructions.

Dreambooth Fine-tuning

DreamBooth allows users to generate contextualized images of one subject using just 3-5 images of the subject, e.g., your dog.

Please refer to the tutorial of DreamBooth for Stable Diffusion Finetuning for detailed instructions.

Textual Inversion Fine-tuning

Textual Inversion learns one or a few text embedding vectors for a new concept, e.g., object or style, with only 3~5 images.

Please refer to the tutorial of Textual Inversion for Stable Diffusion Finetuning for detailed instructions.

Image-to-Image

Image Variation

This pipeline uses a fine-tuned version of Stable Diffusion 2.1, which can be used to create image variations (image-to-image). The pipeline comes with two pre-trained models, 2.1-unclip-l and 2.1-unclip-h, which use the pretrained CLIP Image embedder and OpenCLIP Image embedder separately. You can use the -v argument to decide which model to use. The amount of image variation can be controlled by the noise injected to the image embedding, which can be input by the --noise_level argument. A value of 0 means no noise, while a value of 1000 means full noise.

Preparing Pretrained Weights

To generate variant images by providing a source image, please download one of the following checkpoints and put it in models folder:

SD Version	Lang.	MindSpore Checkpoint	Ref. Official Model	Resolution
2.1-unclip-l	EN	sd21-unclip-l-baa7c8b5.ckpt	stable-diffusion-2-1-unclip	768x768
2.1-unclip-h	EN	sd21-unclip-h-6a73eca5.ckpt	stable-diffusion-2-1-unclip	768x768

And download the image encoder checkpoint ViT-L-14_stats-b668e2ca.ckpt to models folder.

Generating Image Variation

After preparing the pretrained weights, you can run image variation generation by:

python unclip_image_variation.py \
    -v {model version} \
    --image_path {path to input image} \
    --prompt "your magic prompt to run image variation."

-v: model version. Valid values can be referred to SD Version in the above table.

For more argument usage, please run python unclip_image_variation.py --help

Using 2.1-unclip-l model as an example, you may generate variant images based on the example image by

python unclip_image_variation.py \
    -v 2.1-unclip-l \
    --image_path tarsila_do_amaral.png \
    --prompt "a cute cat sitting in the garden"

The output images will be saved in output/samples directory.

you can also add extra noise to the image embedding to increase the amount of variation in the generated images.

python unclip_image_variation.py -v 2.1-unclip-l --image_path tarsila_do_amaral.png --prompt "a cute cat sitting in the garden" --noise_level 200

For image-to-image fine-tuning, please refer to the tutorial of Stable Diffusion unCLIP Finetuning for detailed instructions.

Inpainting

Text-guided image inpainting allows users to edit specific regions of an image by providing a mask and a text prompt, which is an interesting erase-and-replace editing operation. When the prompt is set to empty, it can be applied to auto-fill the masked regions to fit the image context (which is similar to the AI fill and extend operations in PhotoShop-beta).

Preparing Pretrained Weights

To perform inpainting on an input image, please download one of the following checkpoints and put it in models folder:

SD Version	Lang.	MindSpore Checkpoint	Ref. Official Model	Resolution
2.0-inpaint	EN	sd_v2_inpaint-f694d5cf.ckpt	stable-diffusion-2-inpainting	512x512
1.5-wukong-inpaint	CN	wukong-huahua-inpaint-ms.ckpt	N.A.	512x512

Running Image Inpainting

After preparing the pretrained weight, you can run image inpainting by:

python inpaint.py \
    -v {model version}
    --image {path to input image} \
    --mask  {path to mask image} \
    --prompt "your magic prompt to paint the masked region"

-v: model version. Valid values can be referred to SD Version in the above table.

For more argument usage, please run python inpaint.py --help

Using 2.0-inpaint as an example, you can download the example image and mask. Then execute

python inpaint.py \
    -v `2.0-inpaint`
    --image overture-creations-5sI6fQgYIuo.png \
    --mask overture-creations-5sI6fQgYIuo_mask.png \
    --prompt "Face of a yellow cat, high resolution, sitting on a park bench"

The output images will be saved in output/samples directory. Here are some generated results.

Text-guided image inpainting. From left to right: input image, mask, generated images.

By setting empty prompt (--prompt=""), the masked part will be auto-filled to fit the context and background.

Image inpainting. From left to right: input image, mask, generated images

Depth-to-Image

This pipeline allows you to generate new images conditioning on a depth map (preserving image structure) and a text prompt. If you pass an initial image instead of a depth map, the pipeline will automatically extract the depth from it (using Midas depth estimation model) and generate new images conditioning on the image depth, the image, and the text prompt.

Preparing Pretrained Weights

SD Version	Lang.	MindSpore Checkpoint	Ref. Official Model	Resolution
2.0	EN	sd_v2_depth-186e18a0.ckpt	stable-diffusion-2-depth	512x512

And download the depth estimation checkpoint midas_v3_dpt_large-c8fd1049.ckpt to the models/depth_estimator directory.

Depth-to-Image Generation

After preparing the pretrained weight, you can run depth-to-image by:

# depth to image given a depth map and text prompt
python depth_to_image.py \
    --prompt {text prompt} \
    --depth_map {path to depth map} \

In case you don't have the depth map, you can input a source image instead, The pipeline will extract the depth map from the source image.

# depth to image conditioning on an input image and text prompt
python depth_to_image.py \
    --prompt {text prompt} \
    --image {path to initial image} \
    --strength 0.7

--strength indicates how strong the pipeline will transform the initial image. A lower value - preserves more content of the input image. 1 - ignore the initial image and only condition on the depth and text prompt.

The output images will be saved in output/samples directory.

Example:

Download the two-cat image and save it in the current folder. Then execute

python depth_to_image.py --image 000000039769.jpg --prompt "two tigers" --negative_prompt "bad, deformed, ugly, bad anatomy" \

Here are some generated results.

Text-guided depth-to-image. From left to right: input image, estimated depth map, generated images

The two cats are replaced with two tigers while the background and image structure are mostly preserved in the generated images.

ControlNet

ControlNet is a type of model for controllable image generation. It helps make image diffusion models more controllable by conditioning the model with an additional input image. Stable Diffusion can be augmented with ControlNets to enable conditional inputs like canny edge maps, segmentation maps, keypoints, etc.

For detailed instructions on inference and training with ControlNet, please refer to Stable Diffusion with ControlNet.

T2I Adapter

T2I-Adapter is a simple and lightweight network that provides extra visual guidance for Stable Diffusion models without re-training them. The adapter act as plug-in to SD models, making it easy to integrate and use.

For detailed instructions on inference and training with T2I-Adapters, please refer to T2I-Adapter.

Advanced Usage

Model Conversion

We provide tools to convert SD 1.x or SD 2.x model weights from torch to MindSpore format. Please refer to this doc

Schedulers

Currently, we support the following diffusion schedulers.

DDIM
DPM Solver
DPM Solver++
PLMS
UniPC

Detailed illustrations and comparison of these schedulers can be viewed in Diffusion Process Schedulers.

Inference on pre-trained models derived from SD

You could infer with other existing pre-trained models derived from SD, which has undergone extensive fine-tuning processes or trained from scratch on specific datasets. Convert the weights from torch to mindsport format first and do inference with samplers.

Here we provide an example of running inference on the Deliberate Model. Please refer to the instructions here, Inference with the Deliberate Model.

Training with v-prediction

The default objective function in SD training is to minimize the noise prediction error (noise-prediction). To alter the objective to v-prediction, which is used in SD 2.0-v and SD 2.1-v, please refer to v-prediction.md

Diffusion Model Evaluation

We provide different evaluation methods including FID and CLIP-score to evaluate the quality of the generated images. For detailed usage, please refer to Evaluation for Diffusion Models

Safety Checker

Coming soon

Watermark

Coming soon

FAQ

plaese refer to Frequently Asked Questions

What's New

2024.01.10
- Add Textual Inversion fine-tuning
2023.12.01
- Add ControlNet v1
- Add unclip image variation pipeline, supporting both inference and training.
- Add image inpainting pipeline
- Add depth-to-image pipeline
- Fix bugs and improve compatibility to support more Ascend chip types
- Refractor documents
2023.08.30
- Add T2I-Adapter support for text-guided Image-to-Image translation.
2023.08.24
- Add Stable Diffusion v2.1 and v2.1-v (768)
- Support checkpoint auto-download
2023.08.17
- Add Stable Diffusion v1.5
- Add DreamBooth fine-tuning
- Add text-guided image inpainting
- Add CLIP score metrics (CLIP-I, CLIP-T) for evaluating visual and textual fidelity
2023.07.05
- Add negative prompts
- Improve logger
- Fix bugs for MS 2.0.
2023.06.30
- Add LoRA fine-tuning and FID evaluation.
2023.06.12
- Add velocity parameterization for DDPM prediction type. Usage: set parameterization: velocity in configs/your_train.yaml

Contributing

We appreciate all kinds of contributions, including making issues or pull requests to make our work better.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Stable Diffusion based on MindSpore

Table of Contents

Introduction

Supported Models and Pipelines

Installation

Supported Platforms & Versions

Dependency

Install from Source

Dataset Preparation

Text-to-Image

Inference

Preparing Pretrained Weights

Text-to-Image Generation

Inference with different samplers

Distributed Inference

Training

Distributed Training

LoRA Fine-tuning

Dreambooth Fine-tuning

Textual Inversion Fine-tuning

Image-to-Image

Image Variation

Preparing Pretrained Weights

Generating Image Variation

Inpainting

Preparing Pretrained Weights

Running Image Inpainting

Depth-to-Image

Preparing Pretrained Weights

Depth-to-Image Generation

ControlNet

T2I Adapter

Advanced Usage

Model Conversion

Schedulers

Inference on pre-trained models derived from SD

Training with v-prediction

Diffusion Model Evaluation

Safety Checker

Watermark

FAQ

What's New

Contributing