Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image generation with deepspeed --fp16 #394

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

afiaka87
Copy link
Contributor

No description provided.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Nov 30, 2021

The VQGAN simply won't work in 16-bit precision unfortunately. Converting only the torch modules of dalle which aren't VQGAN, and then forcing autocasting to fp32 for the vqgan mitigates this issue and still gives a similar/the same speedup.

It also fixes the issue where you couldn't actually decode when training in fp16 mode and had to wait until after to upconvert your checkpoint to 32 bit.

edit:

Okay here's a little more due diligence https://wandb.ai/afiaka87/vqgan_precision_fixes

it's probably wise to test this on distributed as well.

another edit:
I enabled autocast on the forward pass of the dalle and that works too. speedup was almost the same; no deepspeed required.

@rom1504
Copy link
Contributor

rom1504 commented Nov 30, 2021

Wow amazing! Is that really enough to make it work ?
I've been missing that feature a lot while using deepspeed

@afiaka87
Copy link
Contributor Author

Wow amazing! Is that really enough to make it work ? I've been missing that feature a lot while using deepspeed

please test it! but i think so yes.

@afiaka87
Copy link
Contributor Author

@janEbert end of an era, eh?

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 1, 2021

@lucidrains @rom1504 Has some stability issues surrounding the top_k function I think. Without DeepSpeed to auto-skip NaN's in Pytorch native, training can break after awhile.

This was alleviated quite a bit by using both the --stable and sandwich_norm=True, so it's good to know those work as intended. I think I was able to finish about 3 epochs with those settings whereas I couldn't get finish a single epoch without.

It's probably good to disable by default using the --amp flag we already have.

@lucidrains Do you have plans for continued progress on the repository here? I mostly just wanted to push this up because it had been bothering me so much - but I'm curious if you still intend to create a NUWA repo? Perhaps a clean start?

@lucidrains
Copy link
Owner

lucidrains commented Dec 1, 2021

@afiaka87 hey! thanks for reporting on the stable and sandwich norm. that lines up with my experiences

could you point to the line of code for top_k?

I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E. what other features would you like to see?

I could also add the scaled cosine similarity attention from SwinV2 for additional stability (https://arxiv.org/abs/2111.09883) . That's the only remaining thing I can think of

@rom1504
Copy link
Contributor

rom1504 commented Dec 1, 2021

@afiaka87
I just tried this and got

    @autocast(enabled=True, dtype=torch.float32, cache_enabled=True)
TypeError: __init__() got an unexpected keyword argument 'dtype'

maybe we need a specific version of something?
indeed, this requires torch 1.10

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 1, 2021

@lucidrains

I'm working with some modifications to my code; line numbers are inaccurate. Looking at the loss graph now it's pretty obvious there's a trend toward increasing in loss before the error occurs:

 File "../dalle_pytorch/dalle_pytorch.py", line 515, in generate_images
    sample = torch.multinomial(probs, 1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 

image
Switching to precision=fp32 fixes the issue, for the record.

I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E

I agree - it's been quite the ride accomplishing all this as a community effort!

what other features would you like to see?
I have no requests actually! Everything changes so fast but indeed I think the time is coming to find something new to work on.

I saw your x-clip repository and I'll be certain to pitch in there if I can.

@rom1504
Copy link
Contributor

rom1504 commented Dec 1, 2021

ok I confirm this code is working with torch 1.10, however one drawback is it increases the vram usage (because it's loaded vqgan as float32 instead of float16)
I guess that's probably ok
the optimal thing to do would be to not load vqgan at all and precompute tokens

@rom1504
Copy link
Contributor

rom1504 commented Dec 1, 2021

https://wandb.ai/rom1504/laion_subset/reports/DALLE-dino--VmlldzoxMjg5OTgz here's the experiment with it, which is now nicely displaying samples
first time I train on this dataset, you might see some dinosaur illustrations generated if it works :D (it'll look like this https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion_400m_128G&query=an+illustration+of+a+t-rex+drawing+in+color )

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 1, 2021

ok I confirm this code is working with torch 1.10, however one drawback is it increases the vram usage (because it's loaded vqgan as float32 instead of float16) I guess that's probably ok the optimal thing to do would be to not load vqgan at all and precompute tokens

Did you train long enough to see any NaN/Inf errors?

I intend to disable it by default by using the context manager inside just the training loop; so

with autocast(enabled=args.amp and not using_deepspeed):
  loss = dalle(..)
    
  # backprop
  # zero gradients
  # ...

@rom1504
Copy link
Contributor

rom1504 commented Dec 1, 2021

I started the training 5min ago, so no I don't know

what do you intend to disable?

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 1, 2021

Sorry this only effects non-distributed pytorch. Are you using 16-bit precision with DeepSpeed?

My current impl of mixed precision for pytorch was enabled by default. Due to stability issues I've decided to make it optional.

edit: @rom1504 for more context - I think DeepSpeed's automatic NaN skipping subverts the issue.

@lucidrains
Copy link
Owner

@lucidrains

I'm working with some modifications to my code; line numbers are inaccurate. Looking at the loss graph now it's pretty obvious there's a trend toward increasing in loss before the error occurs:

 File "../dalle_pytorch/dalle_pytorch.py", line 515, in generate_images
    sample = torch.multinomial(probs, 1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 

image Switching to precision=fp32 fixes the issue, for the record.

I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E

I agree - it's been quite the ride accomplishing all this as a community effort!

what other features would you like to see?
I have no requests actually! Everything changes so fast but indeed I think the time is coming to find something new to work on.

I saw your x-clip repository and I'll be certain to pitch in there if I can.

oh shoot, i wasn't aware of this issue

do you want to see if 1.1.5 fixes this? https://github.com/lucidrains/DALLE-pytorch/releases/tag/1.1.5

@rom1504
Copy link
Contributor

rom1504 commented Dec 1, 2021

Sorry this only effects non-distributed pytorch. Are you using 16-bit precision with DeepSpeed?

yes

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 1, 2021

do you want to see if 1.1.5 fixes this? https://github.com/lucidrains/DALLE-pytorch/releases/tag/1.1.5

@lucidrains Yes that stabilized the training thanks

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 1, 2021

Hm - it looks like the dtype specifier isn't available on pytorch LTS. Must be new. I don't know of another way to solve the issue (for deepspeed), however. It would be nice to provide a preprocessor that pre-encodes vqgan encodings to numpy files. @rom1504 didn't the training@home team put a bunch of LAION encoded via the gumbel vqgan on huggingfaces?

@rom1504
Copy link
Contributor

rom1504 commented Dec 1, 2021

Yes several people did that, but nobody packaged a vqgan inference script properly, it would be useful to do

@rom1504
Copy link
Contributor

rom1504 commented Dec 2, 2021

I think it's ok to depend on torch 1.10 ; just need to say it in the readme

@rom1504
Copy link
Contributor

rom1504 commented Dec 2, 2021

I am not sure why exactly but this increases the vram use a lot in multi gpu mode
so I wouldn't recommend merging in this state, this needs more tuning

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 5, 2021

Yes, I would be somewhat more comfortable with a hard requirement on Pytorch 1.10 if it didn't also mean a harsh decision of only CUDA 11.3 (unavailable for my operating system presently) or all the way back to CUDA 10.2. This is relevant for DeepSpeed support as many of their fused operations have very strange support for CUDA versioning that I haven't quite worked out (and seems to change with each change to their main branch). For instance, by choosing this scheme I can comfortably use 16-bit with stage 3 - but attempting to use the DeepSpeed "FusedAdam" on my CPU results in a failed compilation if I don't use Pytorch 1.8.2 with CUDA 11.1; but then of course I can't use the casting features and am stuck again with 32-bit inference.

tl;dr - Forcing a Pytorch version is forcing a CUDA version which isn't always an option for a variety of setups.

@rom1504 - With regard to your comment about VRAM usage increasing; that's not good! I guess it probably has to do with DeepSpeed anticipating 16-bit precision for external parameters - perhaps to the point of using algorithms which have tradeoffs for 32-bit. That sounds very challenging to actually debug though.

At any rate, I think leaving this as an open PR is maybe a good way to inform people that it's possible and that there are some known issues. We could also close this PR and use a pinned issue? @lucidrains whichever you feel works best

@afiaka87
Copy link
Contributor Author

afiaka87 commented Dec 5, 2021

Yes several people did that, but nobody packaged a vqgan inference script properly, it would be useful to do

I'm considering a package similar to your clip-retrieval repo centered around "encode your raw dataset first, use e.g. encoder.to_vqgan to create, then use datasets.from_vqgan to load during training. The idea is to support VQGAN derivatives and the dalle dVAE at first - but there's a lot of training directly on encodings lately in vision/audio/text so I'd like to support those use cases as well eventually.

@janEbert
Copy link
Contributor

@janEbert end of an era, eh?

Big congratulations on finally fixing this! Although if I understand correctly, doesn't this simply "disable" FP16 mode for the erroneous functions? :D

Sorry I never found the time to work on the DeepSpeed stuff further. From what I learned about DeepSpeed, we would've had to rewrite the model classes a bit in order to make everything work (i.e. have the forward call take a flag to generate instead).
That was not really good for the codebase, though, so there was quite a lot of hacking around that limitation which never really resulted in much, sadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants