Stable Diffusion unCLIP Finetuning

Introduction

unCLIP is the approach behind OpenAI's DALL·E 2, trained to invert CLIP image embeddings. This method finetunes SD to accept a CLIP ViT-L/14 image embedding in addition to the text encodings. This means that the model can be used to produce image variations with or without text input.

Get Started

Preparation

Dependency

Please make sure the following frameworks are installed.

mindspore >= 1.9 [install] (2.0 is recommended for the best performance.)
python >= 3.7
openmpi 4.0.3 (for distributed training/evaluation) [install]

Install the dependent packages by running:

pip install -r requirements.txt

Pretrained Models

Please download the pretrained checkpoint SD2.0-v-pred checkpoint and SD-unclip-l checkpoint and put it under models/ folder, and run

python tools/model_conversion/unclip/prepare_unclip_train.py

to combine the parameters from two checkpoints into single one. The combined checkpoint is then saved as models/sd_v2_v_embedder.ckpt.

Text-image Dataset Preparation

The text-image pair dataset for finetuning should follow the file structure below

dir
├── img1.jpg
├── img2.jpg
├── img3.jpg
└── img_txt.csv

img_txt.csv is the annotation file in the following format

dir,text
img1.jpg,a cartoon character with a potted plant on his head
img2.jpg,a drawing of a green pokemon with red eyes
img3.jpg,a red and white ball with an angry look on its face

For convenience, we have prepared one public text-image dataset obeying the above format.

pokemon-blip-caption dataset, containing 833 pokemon-style images with BLIP-generated captions.

To use it, please download pokemon_blip.zip from the openi dataset website. Then unzip them on your local directory, e.g. ./datasets/pokemon_blip.

unCLIP Finetune

We will use the train_unclip_image_variation.py script to train the unclip image variation. Before running, please modify the following arguments to your local path in the shell or in the config file train_config_vanilla_v2_vpred_unclip_l.yaml:

--data_path=/path/to/data
--output_path=/path/to/save/output_data
--pretrained_model_path=/path/to/pretrained_model

Then, execute the script to launch finetuning:

mpirun -n 4 python train_unclip_image_variation.py \
    --train_config "configs/train/train_config_vanilla_v2_vpred_unclip_l.yaml" \
    --data_path "datasets/pokemon_blip/train" \
    --pretrained_model_path "models/sd_v2_v_embedder.ckpt" \
    --output_path unclip-train

Note: to modify other important hyper-parameters, please refer to training config file train_config_vanilla_v2_vpred_unclip_l.yaml.

After training, the checkpoint will be saved in output/unclip-train/ckpt/sd-600.ckpt by default.

Below are some arguments that you may want to tune for a better performance on your dataset:

train_batch_size: the number of batch size for training.
start_learning_rate and end_learning_rate: the initial and end learning rates for training.
epochs: the number of epochs for training.
use_ema: whether use EMA for model smoothing

For more argument illustration, please run python train_unclip_image_variation.py -h.

Inference

To perform image-to-image generation with the finetuned checkpoint, please prepare a test image and run

python unclip_image_variation.py \
    --config configs/v2-vpred-inference-unclip-l.yaml \
    --ckpt_path path_of_the_finetune_checkpoint \
    --image_path path_of_the_test_image \
    --prompt "a picture of a unicorn with orange hair"

Here are the example results.

Prompt	Image Input	Image Output 1	Image Output 2
"a blue jellyfish with red eyes and a red nose"
""
"a picture of a unicorn with orange hair"
""
"a cartoon picture of a blue and white pokemon"
""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image_variation_unclip.md

image_variation_unclip.md

Stable Diffusion unCLIP Finetuning

Introduction

Get Started

Preparation

Dependency

Pretrained Models

Text-image Dataset Preparation

unCLIP Finetune

Inference

Files

image_variation_unclip.md

Latest commit

History

image_variation_unclip.md

File metadata and controls

Stable Diffusion unCLIP Finetuning

Introduction

Get Started

Preparation

Dependency

Pretrained Models

Text-image Dataset Preparation

unCLIP Finetune

Inference