Image generation with deepspeed --fp16 #394

afiaka87 · 2021-11-30T20:01:18Z

No description provided.

afiaka87 · 2021-11-30T20:04:05Z

The VQGAN simply won't work in 16-bit precision unfortunately. Converting only the torch modules of dalle which aren't VQGAN, and then forcing autocasting to fp32 for the vqgan mitigates this issue and still gives a similar/the same speedup.

It also fixes the issue where you couldn't actually decode when training in fp16 mode and had to wait until after to upconvert your checkpoint to 32 bit.

edit:

Okay here's a little more due diligence https://wandb.ai/afiaka87/vqgan_precision_fixes

it's probably wise to test this on distributed as well.

another edit:
I enabled autocast on the forward pass of the dalle and that works too. speedup was almost the same; no deepspeed required.

rom1504 · 2021-11-30T20:05:55Z

Wow amazing! Is that really enough to make it work ?
I've been missing that feature a lot while using deepspeed

afiaka87 · 2021-11-30T20:06:11Z

Wow amazing! Is that really enough to make it work ? I've been missing that feature a lot while using deepspeed

please test it! but i think so yes.

afiaka87 · 2021-11-30T20:28:35Z

@janEbert end of an era, eh?

afiaka87 · 2021-12-01T17:46:54Z

@lucidrains @rom1504 Has some stability issues surrounding the top_k function I think. Without DeepSpeed to auto-skip NaN's in Pytorch native, training can break after awhile.

This was alleviated quite a bit by using both the --stable and sandwich_norm=True, so it's good to know those work as intended. I think I was able to finish about 3 epochs with those settings whereas I couldn't get finish a single epoch without.

It's probably good to disable by default using the --amp flag we already have.

@lucidrains Do you have plans for continued progress on the repository here? I mostly just wanted to push this up because it had been bothering me so much - but I'm curious if you still intend to create a NUWA repo? Perhaps a clean start?

lucidrains · 2021-12-01T19:31:12Z

@afiaka87 hey! thanks for reporting on the stable and sandwich norm. that lines up with my experiences

could you point to the line of code for top_k?

I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E. what other features would you like to see?

I could also add the scaled cosine similarity attention from SwinV2 for additional stability (https://arxiv.org/abs/2111.09883) . That's the only remaining thing I can think of

rom1504 · 2021-12-01T20:21:12Z

@afiaka87
I just tried this and got

    @autocast(enabled=True, dtype=torch.float32, cache_enabled=True)
TypeError: __init__() got an unexpected keyword argument 'dtype'

maybe we need a specific version of something?
indeed, this requires torch 1.10

afiaka87 · 2021-12-01T20:25:54Z

@lucidrains

I'm working with some modifications to my code; line numbers are inaccurate. Looking at the loss graph now it's pretty obvious there's a trend toward increasing in loss before the error occurs:

 File "../dalle_pytorch/dalle_pytorch.py", line 515, in generate_images
    sample = torch.multinomial(probs, 1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Switching to precision=fp32 fixes the issue, for the record.

I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E

I agree - it's been quite the ride accomplishing all this as a community effort!

what other features would you like to see?
I have no requests actually! Everything changes so fast but indeed I think the time is coming to find something new to work on.

I saw your x-clip repository and I'll be certain to pitch in there if I can.

rom1504 · 2021-12-01T20:28:54Z

ok I confirm this code is working with torch 1.10, however one drawback is it increases the vram usage (because it's loaded vqgan as float32 instead of float16)
I guess that's probably ok
the optimal thing to do would be to not load vqgan at all and precompute tokens

rom1504 · 2021-12-01T20:32:02Z

https://wandb.ai/rom1504/laion_subset/reports/DALLE-dino--VmlldzoxMjg5OTgz here's the experiment with it, which is now nicely displaying samples
first time I train on this dataset, you might see some dinosaur illustrations generated if it works :D (it'll look like this https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion_400m_128G&query=an+illustration+of+a+t-rex+drawing+in+color )

afiaka87 · 2021-12-01T20:32:30Z

ok I confirm this code is working with torch 1.10, however one drawback is it increases the vram usage (because it's loaded vqgan as float32 instead of float16) I guess that's probably ok the optimal thing to do would be to not load vqgan at all and precompute tokens

Did you train long enough to see any NaN/Inf errors?

I intend to disable it by default by using the context manager inside just the training loop; so

with autocast(enabled=args.amp and not using_deepspeed):
  loss = dalle(..)
    
  # backprop
  # zero gradients
  # ...

rom1504 · 2021-12-01T20:33:34Z

I started the training 5min ago, so no I don't know

what do you intend to disable?

afiaka87 · 2021-12-01T20:34:35Z

Sorry this only effects non-distributed pytorch. Are you using 16-bit precision with DeepSpeed?

My current impl of mixed precision for pytorch was enabled by default. Due to stability issues I've decided to make it optional.

edit: @rom1504 for more context - I think DeepSpeed's automatic NaN skipping subverts the issue.

lucidrains · 2021-12-01T20:35:53Z

@lucidrains

I'm working with some modifications to my code; line numbers are inaccurate. Looking at the loss graph now it's pretty obvious there's a trend toward increasing in loss before the error occurs:
 File "../dalle_pytorch/dalle_pytorch.py", line 515, in generate_images
    sample = torch.multinomial(probs, 1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 
Switching to precision=fp32 fixes the issue, for the record.

I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E

I agree - it's been quite the ride accomplishing all this as a community effort!

what other features would you like to see?
I have no requests actually! Everything changes so fast but indeed I think the time is coming to find something new to work on.

I saw your x-clip repository and I'll be certain to pitch in there if I can.

oh shoot, i wasn't aware of this issue

do you want to see if 1.1.5 fixes this? https://github.com/lucidrains/DALLE-pytorch/releases/tag/1.1.5

rom1504 · 2021-12-01T20:37:26Z

Sorry this only effects non-distributed pytorch. Are you using 16-bit precision with DeepSpeed?

yes

afiaka87 · 2021-12-01T21:45:43Z

do you want to see if 1.1.5 fixes this? https://github.com/lucidrains/DALLE-pytorch/releases/tag/1.1.5

@lucidrains Yes that stabilized the training thanks

afiaka87 · 2021-12-01T22:12:35Z

Hm - it looks like the dtype specifier isn't available on pytorch LTS. Must be new. I don't know of another way to solve the issue (for deepspeed), however. It would be nice to provide a preprocessor that pre-encodes vqgan encodings to numpy files. @rom1504 didn't the training@home team put a bunch of LAION encoded via the gumbel vqgan on huggingfaces?

rom1504 · 2021-12-01T22:42:34Z

Yes several people did that, but nobody packaged a vqgan inference script properly, it would be useful to do

rom1504 · 2021-12-02T00:47:12Z

I think it's ok to depend on torch 1.10 ; just need to say it in the readme

rom1504 · 2021-12-02T01:41:31Z

I am not sure why exactly but this increases the vram use a lot in multi gpu mode
so I wouldn't recommend merging in this state, this needs more tuning

afiaka87 · 2021-12-05T05:19:29Z

Yes, I would be somewhat more comfortable with a hard requirement on Pytorch 1.10 if it didn't also mean a harsh decision of only CUDA 11.3 (unavailable for my operating system presently) or all the way back to CUDA 10.2. This is relevant for DeepSpeed support as many of their fused operations have very strange support for CUDA versioning that I haven't quite worked out (and seems to change with each change to their main branch). For instance, by choosing this scheme I can comfortably use 16-bit with stage 3 - but attempting to use the DeepSpeed "FusedAdam" on my CPU results in a failed compilation if I don't use Pytorch 1.8.2 with CUDA 11.1; but then of course I can't use the casting features and am stuck again with 32-bit inference.

tl;dr - Forcing a Pytorch version is forcing a CUDA version which isn't always an option for a variety of setups.

@rom1504 - With regard to your comment about VRAM usage increasing; that's not good! I guess it probably has to do with DeepSpeed anticipating 16-bit precision for external parameters - perhaps to the point of using algorithms which have tradeoffs for 32-bit. That sounds very challenging to actually debug though.

At any rate, I think leaving this as an open PR is maybe a good way to inform people that it's possible and that there are some known issues. We could also close this PR and use a pinned issue? @lucidrains whichever you feel works best

afiaka87 · 2021-12-05T05:31:11Z

Yes several people did that, but nobody packaged a vqgan inference script properly, it would be useful to do

I'm considering a package similar to your clip-retrieval repo centered around "encode your raw dataset first, use e.g. encoder.to_vqgan to create, then use datasets.from_vqgan to load during training. The idea is to support VQGAN derivatives and the dalle dVAE at first - but there's a lot of training directly on encodings lately in vision/audio/text so I'd like to support those use cases as well eventually.

janEbert · 2021-12-15T10:22:00Z

@janEbert end of an era, eh?

Big congratulations on finally fixing this! Although if I understand correctly, doesn't this simply "disable" FP16 mode for the erroneous functions? :D

Sorry I never found the time to work on the DeepSpeed stuff further. From what I learned about DeepSpeed, we would've had to rewrite the model classes a bit in order to make everything work (i.e. have the forward call take a flag to generate instead).
That was not really good for the codebase, though, so there was quite a lot of hacking around that limitation which never really resulted in much, sadly.

Clay Mullis added 2 commits November 30, 2021 13:58

(fp16/deepspeed w generations) Cast submodules to half except vqgan.

54c15c6

Autocast VQGAN calls to fp32 in fp16 training

31fa1af

Clay Mullis and others added 2 commits November 30, 2021 15:14

enable pytorch's built-in automatic mixed precision by default

5d843fa

remove duplicate code

43ea4a3

Shico69 approved these changes Apr 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image generation with deepspeed --fp16 #394

Image generation with deepspeed --fp16 #394

afiaka87 commented Nov 30, 2021

afiaka87 commented Nov 30, 2021 •

edited

Loading

rom1504 commented Nov 30, 2021

afiaka87 commented Nov 30, 2021

afiaka87 commented Nov 30, 2021

afiaka87 commented Dec 1, 2021

lucidrains commented Dec 1, 2021 •

edited

Loading

rom1504 commented Dec 1, 2021 •

edited

Loading

afiaka87 commented Dec 1, 2021 •

edited

Loading

rom1504 commented Dec 1, 2021

rom1504 commented Dec 1, 2021

afiaka87 commented Dec 1, 2021 •

edited

Loading

rom1504 commented Dec 1, 2021

afiaka87 commented Dec 1, 2021 •

edited

Loading

lucidrains commented Dec 1, 2021

rom1504 commented Dec 1, 2021

afiaka87 commented Dec 1, 2021 •

edited

Loading

afiaka87 commented Dec 1, 2021

rom1504 commented Dec 1, 2021

rom1504 commented Dec 2, 2021

rom1504 commented Dec 2, 2021 •

edited

Loading

afiaka87 commented Dec 5, 2021

afiaka87 commented Dec 5, 2021

janEbert commented Dec 15, 2021

Image generation with deepspeed --fp16 #394

Are you sure you want to change the base?

Image generation with deepspeed --fp16 #394

Conversation

afiaka87 commented Nov 30, 2021

afiaka87 commented Nov 30, 2021 • edited Loading

rom1504 commented Nov 30, 2021

afiaka87 commented Nov 30, 2021

afiaka87 commented Nov 30, 2021

afiaka87 commented Dec 1, 2021

lucidrains commented Dec 1, 2021 • edited Loading

rom1504 commented Dec 1, 2021 • edited Loading

afiaka87 commented Dec 1, 2021 • edited Loading

rom1504 commented Dec 1, 2021

rom1504 commented Dec 1, 2021

afiaka87 commented Dec 1, 2021 • edited Loading

rom1504 commented Dec 1, 2021

afiaka87 commented Dec 1, 2021 • edited Loading

lucidrains commented Dec 1, 2021

rom1504 commented Dec 1, 2021

afiaka87 commented Dec 1, 2021 • edited Loading

afiaka87 commented Dec 1, 2021

rom1504 commented Dec 1, 2021

rom1504 commented Dec 2, 2021

rom1504 commented Dec 2, 2021 • edited Loading

afiaka87 commented Dec 5, 2021

afiaka87 commented Dec 5, 2021

janEbert commented Dec 15, 2021

afiaka87 commented Nov 30, 2021 •

edited

Loading

lucidrains commented Dec 1, 2021 •

edited

Loading

rom1504 commented Dec 1, 2021 •

edited

Loading

afiaka87 commented Dec 1, 2021 •

edited

Loading

afiaka87 commented Dec 1, 2021 •

edited

Loading

afiaka87 commented Dec 1, 2021 •

edited

Loading

afiaka87 commented Dec 1, 2021 •

edited

Loading

rom1504 commented Dec 2, 2021 •

edited

Loading