Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision #388
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(disclaimer): this is code for training a custom CLIP from the repository here, not the one in the OpenAI repo. For something like that I recommend open_clip. There are valid concerns about the effectiveness of a CLIP trained with a low batch size as the retrieval task has far less context to work with. Food for thought.
There's plenty left to do to make this as robust as the other training scripts, but if you have deepspeed working, this should work now with far fewer caveats than DALL-e. I trained a small CLIP last night on COCO using 16 bit precision, deepspeed stage 3 and cpu offload for both params and the optimizer. I havent done many rigorous comparisons but I was able to actually use my computer while training with it due to cpu offload, which was refreshing.
weights and biases workspace:
https://wandb.ai/dalle-pytorch-replicate/dalle_train_clip_report
I'll be busy for the holidays most likely so won't have time to implement everything else, but it's mostly just copying from the work done in previous contributions in train_dalle.py/train_vae.py. I suspect @janEbert was responsible for ensuring external parameters were flagged for deepspeed in @lucidrains CLIP implementation?
There are likely to be errors as well and there's probably a few things missing from the CLIP paper. I think they clamped their logits to
ln(2)
or similar - not sure if we're doing that.to run with deepspeed, bite the bullet and setup a docker container targeting pytorch=1.7.1, cuda=10.2. Conda works too - make sure you set your python=3.7 as there are issues with >3.7. There's no guarantee that fused operations will run on any particular GPU, even with a docker container, and indeed the only officially supported ones are the V100 and A100. If you see an error about failed JIT compilation - that may be the reason.
run_train_clip.sh
#!/bin/bash deepspeed train_clip.py --dataset_dir=/mnt/evo_internal_1TB/DATASETS/COCO \ --epochs=200 \ --batch_size=128 \ --learning_rate=0.004 \ --clip_grad_norm=1.0 \ --resize_ratio=0.8 \ --truncate_captions=True \ --save_every_n_steps=1000 \ --log_frequency=10 \ --clip_output_file_name=clip_latest.pt \ --dim_text=128 \ --dim_image=128 \ --dim_latent=256 \ --text_enc_depth=6 \ --text_seq_len=128 \ --text_heads=8 \ --num_visual_tokens=256 \ --visual_enc_depth=6 \ --visual_heads=8 \ --visual_image_size=128 \ --visual_patch_size=16 \ --channels=3 \ --num_workers=24 \ --fp16=True \ --distributed_backend=deepspeed
After training has finished, you can create a 32-bit pytorch checkpoint by opening the checkpoint directory: