Trained for 2 epochs on ~570k "illustrations" using a couple CogView ideas #287

afiaka87 · 2021-06-06T20:24:28Z

afiaka87
Jun 6, 2021

Perhaps the most important finding here - all was done in less than a single day on my RTX 2070 with a mere 8 GiB of VRAM thanks (basically) entirely to the use of DeepSpeed's gradient accumulation and automatic mixed precision.

This is largely a reposting of an issue I made: #266 (comment)

I've done a run using --loss_img_weight 1 and setting the presently hidden stable parameter to True in the DALLE initialization.

Here is a W&B report. I'm not tracking text and img loss separately although the average loss seems to converge much quicker; I assume that has something to do with the weighting exploring a "different loss curve". Happy to be corrected.

https://wandb.ai/dalle-pytorch-replicate/illustrations_imagenetvqgan/reports/Snapshot-Jun-6-2021-12-43pm--Vmlldzo3NTYxMjE?accessToken=hhov3b0wsf56tts63wx4qijkl4pnpiogizoh6a32bdctvngy5rvwtygjqpfyl1uj

@lucidrains @robvanvolt @rom1504 @gabriel_syme @janEbert @mehdidc

Here is the byte pair encoding I used. Vocab size of 8192 covering 99.999% of all unique characters in about 6 million captions from conceptual captions. Perhaps overkill for these illustrations actually - which have a more limited vocabulary. Created with youtokentome.

https://www.dropbox.com/s/ay01p8zegfwof8t/variety.bpe

Here is a checkpoint from the most recent iteration (still training). Decided to name the checkpoint "royalty free" as the dataset largely consists of 570,000 royalty free illustrations from the conceptual captions dataset.

Transformer dim: 512
Depth 16
Heads 16
Heads Dimension 64
Gradient Clipping Normalization Factor: 1.0
Batch Size: 64 (size 8 minibatch with gradients accumulated 8 times)
Epochs 2
(max) Sequence Length for the Text Tokens 64 (very little of the dataset exceeds 64, although some captions were truncated.)

Inspired by OpenAI:

Attention Type Pattern: "full, row, col, row" - matching the OpenAI paper a bit here (although i believe they stop using dense after the first layer and use a conv_like for the final layer)
The DALL-E paper makes use of a RandomHorizontalFlip augmentation for the dVaE but which they later claim is dropped for training the Transformer itself because they "want to preserve text". That isn't one of my goals with this checkpoint so I am using a RandomHorizontalFlip(p=0.5) and a somewhat slight ColorJitter(brightness=0.1, saturation=0.2, contrast=0.2). This of course only really matters for the second epoch.

Inspired by CogView

Automatic Mixed Precision "O1" (they use O2)
Learning Rate Warmup schedule from 0 to 3e-4 for the first 20k of ~160k total steps (not exactly uncommon for transformers)
Using hidden "stable" parameter implemented by lucidrains.
Image weighted the same as text

https://www.dropbox.com/s/drpkcmr6b3zbftm/royalty_free.pt

If you'd like to generate from this checkpoint and have access to an Nvidia GPU with cuda support:

git clone https://github.com/lucidrains/DALLE-pytorch
cd DALLE-pytorch
python setup.py install

# Download the byte pair encoding
wget https://www.dropbox.com/s/ay01p8zegfwof8t/variety.bpe

# Download the checkpoint
wget https://www.dropbox.com/s/drpkcmr6b3zbftm/royalty_free.pt

# Create a directory for output images to be stored:
mkdir generations;

# Run `generate.py` to generate 32 images with the `royalt_free.pt`
python generate.py \
        --dalle_path 'royalty_free.pt' \
        --taming \
        --text "royalty free illustration of a snowy sunset" \
        --num_images 32 \
        --batch_size 16 \
        --outputs_dir './generations' \
        --bpe_path variety.bpe

JuliusJacobsohn · 2022-03-07T23:39:16Z

JuliusJacobsohn
Mar 7, 2022

Thanks for providing this info.
May I ask, how long did the training process take in total?
I have around 20 million images lined up, but I only have access to a 3080 TI. Wouldn't mind letting my PC run for a few weeks, but not for a few months.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trained for 2 epochs on ~570k "illustrations" using a couple CogView ideas #287

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Trained for 2 epochs on ~570k "illustrations" using a couple CogView ideas #287

afiaka87 Jun 6, 2021

Replies: 1 comment

JuliusJacobsohn Mar 7, 2022

afiaka87
Jun 6, 2021

JuliusJacobsohn
Mar 7, 2022