Todos for this project #138

cloneofsimo · 2023-01-15T18:34:54Z

cloneofsimo
Jan 15, 2023
Maintainer

To be honest, I am fascinated by how much people enjoyed this project and are using them to have fun. I've got some contacts from many startups and companies, and surprisingly many of them have deployed this project for their pipeline.

Unfortunately, I wouldn't be able to work on this project forever, because I am going to do a master degrees from March, Until then, I have some plans for the future... Some of these could potentially be paper-worthy, and I would definitely be interested in collaboration if someone reading this have a plan.

Research side of things

Modelling

Implement initialization with noisy singular vectors, and see if that works.
Implement parameterized LoRA, so that down matrices are Stiefel manifold, and experiment out bunch of reparameterization as well. Trivialization seems promising for speed as well.
Implement scaling parameter, just like in the paper to see if that helps. (cc @brian6091)
Check : is tuning bias term worth it?
Build "LoRA join" that combines LoRA with another LoRA, without adding them. Specifically,

$$ W_1 = A_1 B_1^T , W_2 = A_2 B_2^T $$

Merging operation would make

$$ W_m = [ A_1 , A_2 ] \Sigma [ B_1 B_2 ] ^T $$

Where $\Sigma = I_{2r}$ . This is now rank $2 r$, and has a capability to become either one. Notice that, with

$$ \Sigma = \begin{bmatrix} I_r & 0 \\ 0 & 0 \end{bmatrix} $$

We have:

$$ W_m x = [ A_1 , A_2 ] \begin{bmatrix} I_r & 0 \\ 0 & 0 \end{bmatrix} [ B_1 B_2 ] ^T x = A_1 B_1 ^T x $$

Find out why LoRAing Resnet does not seem to improve performance, and how to partially use them for up-sample process.
Implement other form of Low-rank approximation as well : LoRA is surely very robust one of them, but there are others. Compactor seems promising.

Dataset

(independent with LoRA. Should I make another repo for this?)

~~Basic stuff : Preprocessor that handles SR, BLIP, CLIPSeg for autocaption.~~ Basic dataset pipelines #139
For faces : e2e face augment, based on mask, Scale-out + inpaint.
Latent caching on the CPU to increase overall speed, and remove the use of VAE. Although that has questionable performance as it would disallow online augmentations. Thinking about memory, it actually reduces the base memory upto 24 so saving 24 augmented examples would be possible. (cc @d8ahazard and @Thomas-MMJ for the trick)
Add Regularization dataset, such as Laion 5b high aesthetics, and reweight them on their importance. (This would be memory-based continual learning. Generalization of what Custom diffusion + Dreambooth did)
Add KD based continual learning dataset, so base model's performance can be directly targeted during fine-tuning. LwF, or this https://arxiv.org/abs/2110.08534.
Is augmenting dataset to better represent the model help as a regularization dataset? (Like minor-level Dataset distillation) Implement & test
(In general, lots of continual learning based methods seems to be unexplored for fine-tuning with high editability.)

Distillation

SVD distillation that further trains the model via Knowledge distillation on random prompts.
~~SVD distillation that distills resnets + other bias terms as well. Bias terms should have small weights.~~ SVD update with conv support + LoRA add, weight update following recent updates #140

Metrics

(independent with LoRA.)
Image alignment (CLIP image score) is extremely bad proxy for what we actually want. There should be a better metric, but no research on this topic.

Implement face similarity score using facenet (vgg-facenet)
Implement logging score estimation based loss values with deterministic noise. I think this should be a reasonable metric, and need to validate that as well...

Other tasks

Inpainting models, but with LoRA so they are trained on SD1.5 and applied on inpainting models. I haven't tried it yet but could definitely work
Depth conditional models, same idea. Both of them might require additional data pipeline if I were to directly train them

Engineering side of things

Memory optimization

What is up with fp16 in textual inversion?
xformer support, latent caching (duplicate)
switching between text encoder and unet (or further recursively within them), iteratively to save performance.

Speed optimization

Try using accelerator from huggingface, initally removed because of gradient accum behavior, but might work again if it becomes stable + try hard enough
Partial JIT compilation from pytorch 2.0 seems promising,
If we make it to fine-tune the up-layer of the whole thing, we might as well pre-compute middle features, have it on RAM, and back-prop there. It would potentially make things half the computational resources!

Approachability (Make it simpler for non-developers)

A1111 Webui support
Custom training extension support
~~Up-to-date Merging operations~~ SVD update with conv support + LoRA add, weight update following recent updates #140

Inference optimization

Make it compatible / port output with AITemplate https://github.com/facebookincubator/AITemplate/tree/main/examples/05_stable_diffusion
(This looks like a huge work, so def not a priority)
Make Onnx compiler options, like CLI

brian6091 · 2023-01-15T19:53:46Z

brian6091
Jan 15, 2023
Collaborator

Thanks for the list. @cloneofsimo you are not lacking in creativity or ambition!

Some quick thoughts.

Modelling

* Implement parameterized LoRA, so that down matrices are Stiefel manifold, and experiment out bunch of reparameterization as well. [Trivialization](https://pytorch.org/docs/stable/generated/torch.nn.utils.parametrizations.orthogonal.html) seems promising for speed as well.

Interesting idea. I wonder if it's somewhat related to idea from this paper (https://arxiv.org/pdf/2206.06122.pdf), which trains only the singular values of from SVD of the linear weights. My intuition is that at full rank, this is the same as LoRA (although they don't cite it), but at reduced rank, maybe it's faster and smaller (assuming you recompute U and V on loading).

* Implement scaling parameter, just like in the paper to see if that helps. (cc @brian6091)

I'll finish my PR soon, there seem to be some benefits of this combined with an extra nonlinearity.

* Build "LoRA join" that combines LoRA with another LoRA, without adding them.

This would be cool. Seems like we could do this right away by just allowing the LoRALayers to checking if the down/up tensors have a third dimension and summing over them in forward?

Dataset

(independent with LoRA. Should I make another repo for this?)

IMO, some of these items should go into another repo (or several given the length of the list!). I think the augementation direction is potentially huge (combined with maybe some semantic segmentation and/or CLIPSeg). Importance sampling, etc...

* Add KD based continual learning dataset, so base model's performance can be directly targeted during fine-tuning. LwF, or this https://arxiv.org/abs/2110.08534.
  (In general, lots of continual learning based methods seems to be unexplored for fine-tuning with high editability.)

Agreed, I think any of the continual learning methods could be applied here. I for one would like to keep adding things to a model, rather than retraining the same base model each time. Time to replace that sledgehammer in Dreambooth (prior class regularization).

Metrics

This seems like such a priority area. Without good metrics, everyone is training in the dark. Even non-ideal metrics seem better than asking someone on Reddit how many steps per image to train for. I find it wild that there isn't more work on this.

Tangentially related is the issue of loss functions. I think everyone uses L2 loss for simplicity, but I'm finding that L1 loss is sometimes better for capturing face details. Seems like understanding ideal loss functions (in latent space) would be interesting. This intersects with continual learning in the case of some regularizations (e.g., Elastic Weight Consolidation).

Speed optimization

* Try using accelerator from huggingface, initally removed because of gradient accum behavior, but might work again if it becomes stable + try hard enough

I found that there is a specific problem with enabling gradient accumulation, but only for the text_encoder. Enabling it only for the unet seems to work fine. I'll leave a comment on the relevant issue when I can find my notes.

Two other thoughts for speed:

The latest schedulers require far fewer function evaluations than the DDIM/DDPM scheduler that everyone seems to have inherited from the diffusers training scripts. Can we take advantage of this?
Likewise, can we use quantization (https://arxiv.org/abs/2211.15736) to gain some speed?

* If we make it to fine-tune the up-layer of the whole thing, we might as well pre-compute middle features, have it on RAM, and back-prop there. It would potentially make things half the computational resources!

I ran a first pass on training only the up_blocks of the Unet, and it does extremely well, even at the same rank (compared to down+mid+up blocks). I'll make some image grids this week for discussion, just need to try it with the conv2d layers as well.

Approachability (Make it simpler for non-developers)

I'm working on a repo for exactly this (hope to have alpha pre-release out this week, but any feedback on the WIP branch is welcome!). Let's you add LoRA as you want (or not) throughout the model (you can make any combination of the models here and much more). I'm building a community pipeline to submit to diffusers so we can get more people onboard with LoRA experimentation (and of course I'll make it compatible with how the weights are saved in @cloneofsimo 's repo).

6 replies

cloneofsimo Jan 15, 2023
Maintainer Author

Based on your comment, fidelity metric + LoRA join should be first things to implement for next version!

brian6091 Jan 15, 2023
Collaborator

LoRA join is neat (although lacking any real theory...). Maybe an implementation should also keep 'scale' as a vector (of the same length as the number of things you want to join). Could weight things differently on-the-fly at very low cost. Might be fun to mix this with something like Composable Diffusion?

cloneofsimo Jan 15, 2023
Maintainer Author

Yeah so during tune_lora_scale, keeping $diag(\Sigma)$ was the initial plan here. we could try to simplify this by detecting the number of effective LoRA models and keep vector of size $n = rank(A) / r$ instead, but it does complicate things bit further...

cloneofsimo Jan 15, 2023
Maintainer Author

If the argument is float, just keep it $\Sigma= \alpha I_{2r}$, and if it is a vector, making it $\Sigma = diag (v)$?

cloneofsimo Jan 15, 2023
Maintainer Author

Ah, things are going to be much more complicated when we have multiple LoRA into the play here

Thomas-MMJ · 2023-01-15T22:20:10Z

Thomas-MMJ
Jan 15, 2023

Also have a look at this recent pull request stable-diffusion, uses a depth or clipseg mask for weighting the loss for faster training.

"This will add the ability to train Embeddings and Hypernetworks using a weighted loss."

AUTOMATIC1111/stable-diffusion-webui#6700

1 reply

cloneofsimo Jan 15, 2023
Maintainer Author

Unlike the name suggests, this seems to be masked score estimation #96 , it is implemented in PTI script in this repo. But the ones from Dreamartist seems to be different : IrisRainbowNeko/DreamArtist-sd-webui-extension@b761e39

They actually penalizes attention maps, which is not the same thing. Interesting approach... wonder if it benefits training!

oscarnevarezleal · 2023-01-17T02:53:17Z

oscarnevarezleal
Jan 17, 2023

For faces : e2e face augment, based on mask, Scale-out + inpaint

^ It's a +1 for me

Approachability (Make it simpler for non-developers)

We need a dead simple onboard document for people with different backgrounds e.g not necessarily Python savant but tech knowledgeable/enthusiast to start trying lora in a couple of minutes.
It might be worthy to mention ( or signal the right direction ) on how to deploy and use Lora in the cloud and, e.g how's the minimal setup look like, Docker maybe? I mean right now there's a huge gap between what's lora , it works on colab and ok, how can I use it in my project

3 replies

cloneofsimo Jan 17, 2023
Maintainer Author

Thank you for the suggestion! I'm so preoccupied with improving performance at this point, I have to put down significant amount of more works to make it much more approachable. LoRA is far from finished, so play stay tuned!

A-Merk Jan 24, 2023

I would second this. Just plain english, step by step, "here's how you go from looking at the github to using it on your pc" would be really helpful for someone like me who is fascinated and very interested in AI art but, unfortunately, has no coding background at all. (an Automatic1111 add on that can be installed easily would also fix this issue for many so I also second that as a goal, please!)

I love what you're doing and can't wait to participate! Thank you for everything you've already done for the community and I understand that we are lucky to have the work you have already done whether or not you do more :)

FunWithFaces Jan 29, 2023

Hey I put a brief walk through of what worked for me over here
#117 (reply in thread)

bmaltais · 2023-01-18T02:24:59Z

bmaltais
Jan 18, 2023

One confusing thing for people is that there a multiple LoRA file format. The one from kohya is the easiest to use via the extension to load many LoRA models without having to merge to a model before using it. If this project could adopt the same LoRA file format it would make it much easier to use them... where now it is cumbersome to have to merge a LoRA into a model to be able to make use of it... unless I missed something...

7 replies

cloneofsimo Jan 19, 2023
Maintainer Author

Unet and TE is in single file for this project as well, and also Textual inversion embedding as well if you use safetensors option. I actually want to have a same format with kohya as well, but I've collaborated with huggingface (lol, I wrote no code @patrickvonplaten did most the works, huge shoutout to diffusers team! huggingface/diffusers#1884 ) and I will make output format compatible with them as well.

That being said, this project is extremely versatile. I've recently added Resnet LoRA possibility to the play #133 , which bring up possibility to new formats, and is currently not supported even with new LoRA Mixin from huggingface diffusers. There are also other alternative reparameterization that could work just as well, such as ones @brian6091 pointed out, and compactors, and ones that I have in mind but not explored even in the LLM field. I want to keep on more experiments to work on which one works the best, instead of trying hard to make extendable format that applies to Compvis format, which conversion script is ridiculously difficult to do canonically due to different module name and implementation details.

bmaltais Jan 20, 2023

Interesting. Having a common standard LoRA weight file format will be important.

I am also discussing with Kohya the possibility of setting different DIM for Unet and TE as somehow using a 128 DIM with TE training at fp16 can result if a failure of training TE... but it work at lower DIM... so being able to set, say DIM=32 for TE and DIM=128 for Unet could have benefits... What do you think?

brian6091 Jan 20, 2023
Collaborator

Sorry, what is DIM?

bmaltais Jan 21, 2023

Sorry, what is DIM?

The weight dim... between 1 and 128. Essentially how large the LoRA weight can be.

brian6091 Jan 21, 2023
Collaborator

Haha, thanks @bmaltais DIM = rank. This is probably the most important parameter, and definitely worth leaving user-settable IMO. Setting this depends on balancing fidelity with efficiency (see some exploration here).

What is the error with TE? Out of memory? At what rank? 128 is not the maximal rank, which will depend where LoRA is injected. So OOM shouldn't happen any more than training the full rank model. Either way, ranks upwards of 128 in the unet are going to give you a pretty big model if applied to all attention blocks...

Todos for this project #138

cloneofsimo Jan 15, 2023 Maintainer

Research side of things

Modelling

Dataset

Distillation

Metrics

Other tasks

Engineering side of things

Memory optimization

Speed optimization

Approachability (Make it simpler for non-developers)

Inference optimization

Replies: 4 comments · 17 replies

brian6091 Jan 15, 2023 Collaborator

Modelling

Dataset

Metrics

Speed optimization

Approachability (Make it simpler for non-developers)

cloneofsimo Jan 15, 2023 Maintainer Author

brian6091 Jan 15, 2023 Collaborator

cloneofsimo Jan 15, 2023 Maintainer Author

cloneofsimo Jan 15, 2023 Maintainer Author

cloneofsimo Jan 15, 2023 Maintainer Author

Thomas-MMJ Jan 15, 2023

cloneofsimo Jan 15, 2023 Maintainer Author

oscarnevarezleal Jan 17, 2023

Approachability (Make it simpler for non-developers)

cloneofsimo Jan 17, 2023 Maintainer Author

A-Merk Jan 24, 2023

FunWithFaces Jan 29, 2023

bmaltais Jan 18, 2023

cloneofsimo Jan 19, 2023 Maintainer Author

bmaltais Jan 20, 2023

brian6091 Jan 20, 2023 Collaborator

bmaltais Jan 21, 2023

brian6091 Jan 21, 2023 Collaborator

cloneofsimo
Jan 15, 2023
Maintainer

Replies: 4 comments 17 replies

brian6091
Jan 15, 2023
Collaborator

cloneofsimo Jan 15, 2023
Maintainer Author

brian6091 Jan 15, 2023
Collaborator

cloneofsimo Jan 15, 2023
Maintainer Author

cloneofsimo Jan 15, 2023
Maintainer Author

cloneofsimo Jan 15, 2023
Maintainer Author

Thomas-MMJ
Jan 15, 2023

cloneofsimo Jan 15, 2023
Maintainer Author

oscarnevarezleal
Jan 17, 2023

cloneofsimo Jan 17, 2023
Maintainer Author

bmaltais
Jan 18, 2023

cloneofsimo Jan 19, 2023
Maintainer Author

brian6091 Jan 20, 2023
Collaborator

brian6091 Jan 21, 2023
Collaborator