AdamW optimizer gives better loss convergence but poor image generation #139
Replies: 5 comments 4 replies
-
awesome! would using madgrad yield faster convergence.
|
Beta Was this translation helpful? Give feedback.
-
Cause of poor generationThe cause of poor generation is weight decay. Line 146 in 4aade13 When I just discarded all the weight decay, the generation became normal. Most importantly, AdamW seems better at loss convergence than Adam, see the experiment below. Experiment
|
Beta Was this translation helpful? Give feedback.
-
@Ki6an |
Beta Was this translation helpful? Give feedback.
-
Hello! Great results! Could you say, how many epochs the model was trained? |
Beta Was this translation helpful? Give feedback.
-
Hi are there any updates? |
Beta Was this translation helpful? Give feedback.
-
In DALLE paper, they used AdamW optimizer, LR warmup, and LR decay. (related PR: #138)
And I found LR decay actually helpful in #131, but AdamW optimizer gives very weird results.
As shown in the below results, AdamW optimizer gives very fast loss decrease in image cross-entropy (CE) loss, while text CE loss is the same with Adam optimizer.
However, the visualization of AdamW optimizer shows corrupted images, which is so weird.
The low image CE loss means the model is predicting correct image tokens, but the generated image is poor.
I couldn't figured it out yet, so wanted to discuss about this with you folks 😃 .
What do you think about think about this?
AdamW experiment
Training graph
Visualization
Beta Was this translation helpful? Give feedback.
All reactions