Using high LR + Learning rate scheduling with captions for better preservation #69
Replies: 2 comments 13 replies
-
Whoa pretty amazing result you've got there! Thank you for sharing!! |
Beta Was this translation helpful? Give feedback.
-
Great experiment! Seems to support the idea of different modules requiring different learnings rates (related perhaps to the idea of using different learning rates for different layers. A little hard to disentangle the learning rate that you set globally from the effect of ADAM, which modifies learning rates on a per-parameter basis. Wonder what this would look like if you swapped in vanilla SGD? @ExponentialML , could you provide links to the training images? I'd like to give this a spin to try and reproduce. Also, I'm not familiar with that VAE, can I get a link? |
Beta Was this translation helpful? Give feedback.
-
Edit (1-14-2023): You may still refer to this as guidance, but this pull request yields the best results to date.
Hey all! So I took some time to explore LoRA and schedulers to see if I can find any ways to preserve more details of the training set without losing much editability. Instead of using instance tokens, I wanted to see if you could "fine tune" the model in a more traditional way very quickly with a small dataset.
I became kind of interested in this discussion on hypernetworks, and how different activations or schedulers can lead to different results based on what results you are trying to get. I never gave it much thought because I haven't really strayed far from the official Dreambooth methods when it came to this type of training (constant LR, instance token, etc.)
Since LoRA benefits from a high learning rate, I thought, why not set it even higher? Ironically in some cases, it may be a benefit to allow some form of overfitting because you're given the leeway to adjust the parameters of how the weights are applied to the model.
To get the results in the above image, here were the settings I used. For training, I used AUTOMATIC's Webui with LoRA training enabled.
Unet LR: 1e-3
Text encoder LR: 5e-5
Scheduler: Polynomial (Default AdamW settings, weight decay = 0.01)
Epoches: 300 per image (1500 steps total in this particular case)
Warmup: 0
Unet Weight: 0.8
Text Encoder Weight: 0.9
Prior Preservation: None
Horizontal Flip: None
Instance Prompt: [filewords]
Class Prompt: [filewords]
At an extremely high level, the way this scheduler works is by starting at an initial rate (1e-3 at the power of 1.0 by default) then goes lower as each epoch is completed. In the case of LoRA, this allows you to capture an extremely large amount of details.
It has been shown that LoRA captures pretty good details at 1e-4, but suffers at a constant rate. Looking at the current training settings, we start at 1e-3 and ends at 1e-4 over the course of the training period. This allows to have the best of both worlds when it comes to LoRA training.
For the training data, I simply used 5 training images from the internet of Kiriko and captions to train. No instance tokens were used. This was sparked by this idea on Reddit.
The thing that intrigues me most about this method is that it took me no time at all to create this dataset (10 minutes max). Being able to quickly prototype datasets in this manner alongside with the weight variables (Unet / Text Encoder) make this an amazing addition to many workflows.
I will most definitely experiment this avenue more, and this will most definitely probably benefit from a larger training dataset or even using regularization images. Would love to hear anybody's thoughts or additions on the matter!
Beta Was this translation helpful? Give feedback.
All reactions