Skip to content

Latest commit

 

History

History
186 lines (138 loc) · 8.53 KB

textual_inversion_finetune.md

File metadata and controls

186 lines (138 loc) · 8.53 KB

Textual Inversion for Stable Diffusion Finetuning

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Introduction

Textual Inversion is a method to train a pretrained text-to-image model to generate images of a specific unique concept, modify their appearance, or compose them in new roles and novel scenes. It does not require lots of training data, only 3-5 images of a user-provided concept. It does not update the weights of the text-to-image model to learn the new concept, but learns it through new "words" in the embedding space of a frozen text encoder. These new "words" can be composed into natural language sentences, making it plausible to generate personalized visual content.

Figure 1. The Diagram of Textual Inversion. [1]

As shown above, the textual inversion method consists of the following steps:

  1. First, it creates a text prompt following some template like A photo of $S_{*}$, where $S_{*}$ is a placeholder for the new "word" to be learned.
  2. Then, the tokenizer will assign a unique index to this placeholder. This index corresponds to a single embedding vector $v_{*}$ in the embedding lookup table which is trainable, while all other parameters are non-trainable.
  3. Lastly, compute the loss function of the generator (e.g., stable diffusion model) and the gradients of the single embedding vector, then update the weights in $v_{*}$ during training steps.

Preparation

Dependency

Make sure the following frameworks are installed.

  • mindspore 2.1.0 (Ascend 910) / mindspore 2.2.1 (Ascend 910*)

Enter the example/stable_diffusion_v2 folder and run

pip install -r requirements.txt

Pretrained Models

Pretrained models will be automatically downloaded when launching the training script. See _version_cfg in train_textual_inversion.py for more information.

Since we use CLIPTokenizer from the transformers library, please also install this library using:

pip install transformers>=4.16.0

In addition, please prepare the tokenizer openai/clip-vit-large-patch14 from the huggingface website for the usage of CLIPTokenizer. The directory tree of openai/ folder should be:

openai/
└── clip-vit-large-patch14
    ├── config.json
    ├── merges.txt
    ├── preprocessor_config.json
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

Finetuning Dataset Preparation

Depending on the concept that we want the finetuned model to learn, the datasets can be divided into two groups: the datasets of the same object and the datasets of the same style.

For object dataset, we use the cat-toy dataset. The dataset contains six images which are shown below.

Figure 2. The cat-toy example dataset for finetuning.

For style dataset, we use the test set of the chinese-art dataset, which contains 20 images. Some example images are shown below:

Figure 3. The example images from the test set of the chinese-art dataset

For the details of downloading chinese-art dataset, please refer to LoRA: Text-image Dataset Preparation.

The finetuning images of the same dataset should be placed under the same folder, like this:

dir-to-images
├── img1.jpg
├── img2.jpg
├── img3.jpg
├── img4.jpg
└── img5.jpg

We name the folder containing the object dataset as datasets/cat_toy, and the folder containing the test set of the chinese_art dataset as datasets/chinese_art.

Finetuning

The key arguments for finetuning experiments are explained as follows:

  • num_vectors: the number of trainable text embeddings for the text encoder. A larger value indicates a larger capacity. We recommend taking a grid search to find the optimal number of vectors for each training experiment.
  • start_learning_rate: the initial learning rate for linear decay learning scheduler.
  • max_steps: the maximum number of training steps, which overwrites the number of training epochs --epochs.
  • gradient_accumulation_steps: the number of gradient accumulation steps. The default value is 4.
  • placeholder_token: the token $S_{*}$.
  • initializer_token: the token used to initialize the newly added text embedding.
  • learnable_property: one of ["object", "style"].
  • scale_lr: whether to scale the learning rate based on the batch size, gradient accumulation steps, and n cards.

In the following tutorial, we will use SDv1.5 as an example. The hyperparameter configuration file for SDv1.5 is configs/train/train_config_textual_inversion_v1.yaml.

We also support finetuning SDv2.0(2.1) with configs/train/train_config_textual_inversion_v2.yaml. Compared with the hyperparameters for SDv1.5, we use a smaller learning rate (x0.5) and train more steps (x2.0) for SDv2.0 since SDv2.0 is more likely to overfit.

Object Dataset Experiment

The optimal hyperparameters for the cat-toy dataset with SDv1.5 are num_vectors=3, start_learning_rate=1e-4, and max_steps=3000. See configs/train/train_config_textual_inversion_v1.yaml for more information.

The standalone training command for SDv1.5 finetuning on the cat-toy dataset:

python train_textual_inversion.py \
    --train_config configs/train/train_config_textual_inversion_v1.yaml  \
    --output_path="output/"

Suppose the saved checkpoint file is output/weights/SDv1.5_textual_inversion_3000_ti.ckpt, we use the following command to run inference with the newly learned text embedding.

python text_to_image.py  \
    --version "1.5" \
    --prompt "a <cat-toy> backpack" \
    --config configs/v1-train-textual-inversion.yaml  \
    --ti_ckpt_path output/weights/SDv1.5_textual_inversion_3000_ti.ckpt \
    --output_path vis/

The generated images using the prompt "a <cat-toy> backpack" are show below:

Figure 4. The generated images using the textual inversion weight.

Style Dataset Experiment

For chinese-art dataset, we need to first change the key arguments in configs/train/train_config_textual_inversion_v1.yaml as follows:

train_data_dir: "datasets/chinese_art"
placeholder_token: "<chinese-art>"
initializer_token: "art"
learnable_property: "style"
num_vectors: 1
...
max_steps:4000

Then, we can use the following command for training:

python train_textual_inversion.py  \
    --train_config configs/train/train_config_textual_inversion_v1.yaml  \
    --output_path="output/chinese_art"

After training, suppose we have the saved checkpoint file output/chinese_art/weights/SDv1.5_textual_inversion_4000_ti.ckpt, we use the following command to run inference with the newly learned text embedding.

python text_to_image.py  \
    --version "1.5" \
    --prompt "a dog in <chinese-art> style" \
    --config configs/v1-train-textual-inversion.yaml  \
    --ti_ckpt_path output/chinese_art/weights/SDv1.5_textual_inversion_4000_ti.ckpt \
    --output_path vis/

The generated images using the prompt "a dog in <chinese-art> style" are show below:

Figure 5. The generated images using the textual inversion weight.

References

[1] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, Daniel Cohen-Or: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. ICLR 2023