AdamW optimizer gives better loss convergence but poor image generation #139

kobiso · 2021-03-30T06:13:01Z

kobiso
Mar 30, 2021

In DALLE paper, they used AdamW optimizer, LR warmup, and LR decay. (related PR: #138)

And I found LR decay actually helpful in #131, but AdamW optimizer gives very weird results.
As shown in the below results, AdamW optimizer gives very fast loss decrease in image cross-entropy (CE) loss, while text CE loss is the same with Adam optimizer.
However, the visualization of AdamW optimizer shows corrupted images, which is so weird.
The low image CE loss means the model is predicting correct image tokens, but the generated image is poor.
I couldn't figured it out yet, so wanted to discuss about this with you folks 😃 .
What do you think about think about this?

AdamW experiment

Models
- Orange color: Adam optimizer
- Gray: AdamW optimizer

Training graph

Visualization

Ki6an · 2021-04-01T21:32:42Z

Ki6an
Apr 1, 2021

awesome! would using madgrad yield faster convergence.

A best-of-both-worlds optimizer with the generalization performance of SGD and at least as fast convergence as that of Adam, often faster

1 reply

kobiso Apr 2, 2021
Author

Yeap! Since the reason of poor generation was weight decay, MADGRAD could be a good option.
They mentioned "You may need to use a lower weight decay than you are accustomed to. Often 0.".
I will try it out :)

kobiso · 2021-04-02T01:01:23Z

kobiso
Apr 2, 2021
Author

Cause of poor generation

The cause of poor generation is weight decay.
Even though i added

DALLE-pytorch/train_dalle.py

Line 146 in 4aade13

def group_weight(model):

discarding weight decay for bias and LayerNorm weight, the poor generation happened.
When I just discarded all the weight decay, the generation became normal.
Most importantly, AdamW seems better at loss convergence than Adam, see the experiment below.

Experiment

Models
- Orange: Adam
- Red: AdamW without weight decay

1 reply

afiaka87 Apr 6, 2021

Filed an issue with regard to the weight decay here #170 @kobiso

kobiso · 2021-04-21T14:48:35Z

kobiso
Apr 21, 2021
Author

@Ki6an
I did some experiments on MADGRAD optimizer, but I couldn't find any advantage or performance improvement compared to Adam 😅

2 replies

lucidrains Apr 30, 2021
Maintainer

@kobiso what kind of learning rate have you tried training DALL-E on with MadGrad? another researcher recommends that the learning rate is at least 10x what you would normally use with Adam

kobiso May 3, 2021
Author

@lucidrains
I tried MADGRAD with lr 0.00045 (best in Adam), 0.001 (larger), 0.0001 (smaller).
And lr 0.00045 was the best among them.
But still, I didn't see any better performance compared to Adam trained model.
I will see if 10x performs well :)

alexander-shustanov · 2021-05-05T18:59:01Z

alexander-shustanov
May 5, 2021

Hello! Great results! Could you say, how many epochs the model was trained?

0 replies

shizhediao · 2022-02-24T23:27:17Z

shizhediao
Feb 24, 2022

Hi are there any updates?
Could I use adamw to optimize the training process?
Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdamW optimizer gives better loss convergence but poor image generation #139

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

AdamW optimizer gives better loss convergence but poor image generation #139

kobiso Mar 30, 2021

AdamW experiment

Training graph

Visualization

Replies: 5 comments · 4 replies

Ki6an Apr 1, 2021

kobiso Apr 2, 2021 Author

kobiso Apr 2, 2021 Author

Cause of poor generation

Experiment

afiaka87 Apr 6, 2021

kobiso Apr 21, 2021 Author

lucidrains Apr 30, 2021 Maintainer

kobiso May 3, 2021 Author

alexander-shustanov May 5, 2021

shizhediao Feb 24, 2022

kobiso
Mar 30, 2021

Replies: 5 comments 4 replies

Ki6an
Apr 1, 2021

kobiso Apr 2, 2021
Author

kobiso
Apr 2, 2021
Author

kobiso
Apr 21, 2021
Author

lucidrains Apr 30, 2021
Maintainer

kobiso May 3, 2021
Author

alexander-shustanov
May 5, 2021

shizhediao
Feb 24, 2022