unCLIP is the approach behind OpenAI's DALL·E 2, trained to invert CLIP image embeddings. This method finetunes SD to accept a CLIP ViT-L/14 image embedding in addition to the text encodings. This means that the model can be used to produce image variations with or without text input.
Please make sure the following frameworks are installed.
- mindspore >= 1.9 [install] (2.0 is recommended for the best performance.)
- python >= 3.7
- openmpi 4.0.3 (for distributed training/evaluation) [install]
Install the dependent packages by running:
pip install -r requirements.txt
Please download the pretrained checkpoint SD2.0-v-pred checkpoint and SD-unclip-l checkpoint and put it under models/
folder, and run
python tools/model_conversion/unclip/prepare_unclip_train.py
to combine the parameters from two checkpoints into single one. The combined checkpoint is then saved as models/sd_v2_v_embedder.ckpt
.
The text-image pair dataset for finetuning should follow the file structure below
dir
├── img1.jpg
├── img2.jpg
├── img3.jpg
└── img_txt.csv
img_txt.csv is the annotation file in the following format
dir,text
img1.jpg,a cartoon character with a potted plant on his head
img2.jpg,a drawing of a green pokemon with red eyes
img3.jpg,a red and white ball with an angry look on its face
For convenience, we have prepared one public text-image dataset obeying the above format.
- pokemon-blip-caption dataset, containing 833 pokemon-style images with BLIP-generated captions.
To use it, please download pokemon_blip.zip
from the openi dataset website. Then unzip them on your local directory, e.g. ./datasets/pokemon_blip
.
We will use the train_unclip_image_variation.py
script to train the unclip image variation.
Before running, please modify the following arguments to your local path in the shell or in the config file train_config_vanilla_v2_vpred_unclip_l.yaml
:
--data_path=/path/to/data
--output_path=/path/to/save/output_data
--pretrained_model_path=/path/to/pretrained_model
Then, execute the script to launch finetuning:
mpirun -n 4 python train_unclip_image_variation.py \
--train_config "configs/train/train_config_vanilla_v2_vpred_unclip_l.yaml" \
--data_path "datasets/pokemon_blip/train" \
--pretrained_model_path "models/sd_v2_v_embedder.ckpt" \
--output_path unclip-train
Note: to modify other important hyper-parameters, please refer to training config file
train_config_vanilla_v2_vpred_unclip_l.yaml
.
After training, the checkpoint will be saved in output/unclip-train/ckpt/sd-600.ckpt
by default.
Below are some arguments that you may want to tune for a better performance on your dataset:
train_batch_size
: the number of batch size for training.start_learning_rate
andend_learning_rate
: the initial and end learning rates for training.epochs
: the number of epochs for training.use_ema
: whether use EMA for model smoothing
For more argument illustration, please run python train_unclip_image_variation.py -h
.
To perform image-to-image generation with the finetuned checkpoint, please prepare a test image and run
python unclip_image_variation.py \
--config configs/v2-vpred-inference-unclip-l.yaml \
--ckpt_path path_of_the_finetune_checkpoint \
--image_path path_of_the_test_image \
--prompt "a picture of a unicorn with orange hair"
Here are the example results.
Prompt | Image Input | Image Output 1 | Image Output 2 |
---|---|---|---|
"a blue jellyfish with red eyes and a red nose" | |||
"" | |||
"a picture of a unicorn with orange hair" | |||
"" | |||
"a cartoon picture of a blue and white pokemon" | |||
"" |