-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111
base: develop
Are you sure you want to change the base?
Conversation
Note that LoRA paper defines scale = alpha/r
Nice work as always!! |
Keep tune_lora_scale as previously to minimize breakage.
Ooooo so alpha = 4 is the default? This is going to make my head hurt 🫠 |
What if we set alpha = 1 as default, and scale up lora instead? |
So I set alpha=4 as default so that everything should produce scale=1.0 by default (which is what you had before). Hence my last addition of a 'tune_lora_alpha' and leaving 'tune_lora_scale' alone. This should be transparent to most users, as they won't see alpha unless they want to... I agree that alpha and scale are redundant, and I honestly have no idea why it was defined that way in the paper. |
Hmm I want to talk about this bit more deeper. Its just that I"m not really understanding why we need to introduce scale term here. I think that the previous implementation + new implementation on the initialization with scale would be enough. My reason is suppose we train a lora with scale = 16 But instead, since If we instead use the scale parameter to make A larger on the initialization (and use higher lr), end user would just know 1. Is the "default full weight", and would need to use somewhere 0.~1 to get a decent result, which is less confusing |
But its just my thought, and I think this will change a lot of code all at once, (with readme and some of your analysis as well) so im happy to discuss with more depth |
You raise great points that I completely agree with. Using both alpha and scale is not a good idea, and even the LoRA authors ignore this after defining it. So I think it's best to revert back to just using scale. Now the question is whether to expose scale for users. As you said, changing the scale can effectively be compensated by changing the learning rate or the the variance of the weight initialization. He et al. casually mention that changing scale can have an impact, and they actually hand-tune it. This may become more relevant if the pointwise nonlinearity is incorporated, since in that case, scale is no longer equivalent to just changing the variance of the weight initialization. So here's what I suggest:
|
I think whatever works fine if it is just backward compatible. Overall, I just want to keep the notion of "full training == |
Basically because I've built many test scripts with those notion, and lot of other implementations are on that idea as well, so I think it would surprise a lot of people if alpha is unknown argument, and you suddenly need to use large scale value to get it to work |
So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?
So save |
Yeah, that sounds good. |
Not strictly a LoRA change, but quite easy to fit into the current code. The idea is that squeezing an element-wise nonlinearity in between the Down and Up transformations can in some cases increase your performance at essentially no cost.
I also exposed the alpha parameter (keeping the convention from the original LoRA paper that scale=alpha/rank), which can also get you some extra performance for little effort.
Ideas from this fantastic paper:
He, Junxian, et al. "Towards a unified view of parameter-efficient transfer learning." arXiv preprint arXiv:2110.04366 (2021).
https://arxiv.org/pdf/2110.04366.pdf
Any thoughts before I start testing more thoroughly?