Using high LR + Learning rate scheduling with captions for better preservation #69

ExponentialML · 2022-12-21T05:45:03Z

ExponentialML
Dec 21, 2022

Detailed concept art of kiriko as a dark souls boss; fantasy, wearing a red headband, 4 k, (canon eos rebel), ((bokeh with depth of field)), gorgeous, beautiful.
Steps: 20, Sampler: DPM++ SDE, CFG scale: 3.5, Seed: 430100849, Size: 768x512, SD 1.5 with vae-ft-ema-560000-ema-pruned VAE

Edit (1-14-2023): You may still refer to this as guidance, but this pull request yields the best results to date.

Hey all! So I took some time to explore LoRA and schedulers to see if I can find any ways to preserve more details of the training set without losing much editability. Instead of using instance tokens, I wanted to see if you could "fine tune" the model in a more traditional way very quickly with a small dataset.

I became kind of interested in this discussion on hypernetworks, and how different activations or schedulers can lead to different results based on what results you are trying to get. I never gave it much thought because I haven't really strayed far from the official Dreambooth methods when it came to this type of training (constant LR, instance token, etc.)

Since LoRA benefits from a high learning rate, I thought, why not set it even higher? Ironically in some cases, it may be a benefit to allow some form of overfitting because you're given the leeway to adjust the parameters of how the weights are applied to the model.

To get the results in the above image, here were the settings I used. For training, I used AUTOMATIC's Webui with LoRA training enabled.

Unet LR: 1e-3
Text encoder LR: 5e-5
Scheduler: Polynomial (Default AdamW settings, weight decay = 0.01)
Epoches: 300 per image (1500 steps total in this particular case)
Warmup: 0
Unet Weight: 0.8
Text Encoder Weight: 0.9
Prior Preservation: None
Horizontal Flip: None
Instance Prompt: [filewords]
Class Prompt: [filewords]

At an extremely high level, the way this scheduler works is by starting at an initial rate (1e-3 at the power of 1.0 by default) then goes lower as each epoch is completed. In the case of LoRA, this allows you to capture an extremely large amount of details.

It has been shown that LoRA captures pretty good details at 1e-4, but suffers at a constant rate. Looking at the current training settings, we start at 1e-3 and ends at 1e-4 over the course of the training period. This allows to have the best of both worlds when it comes to LoRA training.

For the training data, I simply used 5 training images from the internet of Kiriko and captions to train. No instance tokens were used. This was sparked by this idea on Reddit.

The thing that intrigues me most about this method is that it took me no time at all to create this dataset (10 minutes max). Being able to quickly prototype datasets in this manner alongside with the weight variables (Unet / Text Encoder) make this an amazing addition to many workflows.

a high quality render of kiriko with smiling and looking to the right with an embarrased look on her face, high quality render

the 3d game character kiriko wearing a red headband and in a fighting pose, inside of a room, great incandescent lighting

a high quality 3d render of kiriko wearing a red headband looking at the camera with a smile and praying, great lighting and smooth details

kiriko wearing a red headband and in a fighting pose, cyberpunk setting

kiriko wearing a red headband and is in a landing pose with her hand on the ground, 3d game character from overwatch, city streets setting

I will most definitely experiment this avenue more, and this will most definitely probably benefit from a larger training dataset or even using regularization images. Would love to hear anybody's thoughts or additions on the matter!

cloneofsimo · 2022-12-21T05:49:01Z

cloneofsimo
Dec 21, 2022
Maintainer

Whoa pretty amazing result you've got there! Thank you for sharing!!

5 replies

cloneofsimo Dec 21, 2022
Maintainer

And just a quick question

this isn't pivotal tuning right?
Can I share this link & result on twitter?

ExponentialML Dec 21, 2022
Author

I did this without pivotal tuning.
By all means! Go for it 👍 .

cloneofsimo Dec 21, 2022
Maintainer

Thanks! Thanks for sharing such amazing configurations and tips with us again!

cloneofsimo Dec 21, 2022
Maintainer

Also, which training code did you use? There is no caption capability in the ones in my repo, did you use @brian6091 's?

ExponentialML Dec 21, 2022
Author

Also, which training code did you use? There is no caption capability in the ones in my repo, did you use @brian6091 's?

Thanks for mentioning, I've updated the post. I used AUTOMATIC's webui with LoRA enabled. It's currently behind this repository in terms of updates, so I didn't get a chance to use the pivotal tuning feature (I've used it before from this repository, just not with this method).

brian6091 · 2022-12-21T09:20:47Z

brian6091
Dec 21, 2022
Collaborator

Great experiment! Seems to support the idea of different modules requiring different learnings rates (related perhaps to the idea of using different learning rates for different layers. A little hard to disentangle the learning rate that you set globally from the effect of ADAM, which modifies learning rates on a per-parameter basis. Wonder what this would look like if you swapped in vanilla SGD?

@ExponentialML , could you provide links to the training images? I'd like to give this a spin to try and reproduce. Also, I'm not familiar with that VAE, can I get a link?

8 replies

ExponentialML Dec 21, 2022
Author

Why not just set the warmup=0? This will start your lr at the value you specify. Works for all the schedules that have a warmup parameter.

There was a recently merged PR at the extension repository that sets this to 0 by default.
I always use 0 warmup unless I'm running tests, but I haven't pulled the latest version with these new changes yet so I manually set it.

brian6091 Dec 21, 2022
Collaborator

Ah got it. But what's wrong with some warmup? Might be doing some good (interesting reading here, and here).

krahnikblis Dec 21, 2022

i've been messing with SGDR with some success, starting with all the maths they did in the paper about the doubling of step number and the decreasing of peak LR over the restart cycles... i just made the warmup be like 1/16th of the schedule duration. so far getting better results than AdamW

SGD with warm restarts LR scheduler, from Loschilov & Hutter

brian6091 Dec 22, 2022
Collaborator

@krahnikblis
Thanks for that, I've been playing with cosine with restarts, but it hadn't occurred to me to try skipping AdamW altogether! When you say better results, are you referring to diffusion fine-tuning?

krahnikblis Dec 30, 2022

@brian6091 sorry just saw this... i've been tinkering with it in textual inversion process. i tried a few different optimizers and LR schedulers (optax makes that pretty easy), the sgdr seems to make the most intuitive sense to me and it gave higher quality results after the same number of train steps. i tend to switch things around in batches though (i.e., can't always easily isolate what i did that improved/broke a thing), so ymmv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using high LR + Learning rate scheduling with captions for better preservation #69

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Using high LR + Learning rate scheduling with captions for better preservation #69

ExponentialML Dec 21, 2022

Replies: 2 comments · 13 replies

cloneofsimo Dec 21, 2022 Maintainer

cloneofsimo Dec 21, 2022 Maintainer

ExponentialML Dec 21, 2022 Author

cloneofsimo Dec 21, 2022 Maintainer

cloneofsimo Dec 21, 2022 Maintainer

ExponentialML Dec 21, 2022 Author

brian6091 Dec 21, 2022 Collaborator

ExponentialML Dec 21, 2022 Author

brian6091 Dec 21, 2022 Collaborator

krahnikblis Dec 21, 2022

brian6091 Dec 22, 2022 Collaborator

krahnikblis Dec 30, 2022

ExponentialML
Dec 21, 2022

Replies: 2 comments 13 replies

cloneofsimo
Dec 21, 2022
Maintainer

cloneofsimo Dec 21, 2022
Maintainer

ExponentialML Dec 21, 2022
Author

cloneofsimo Dec 21, 2022
Maintainer

cloneofsimo Dec 21, 2022
Maintainer

ExponentialML Dec 21, 2022
Author

brian6091
Dec 21, 2022
Collaborator

ExponentialML Dec 21, 2022
Author

brian6091 Dec 21, 2022
Collaborator

brian6091 Dec 22, 2022
Collaborator