-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Argument to enable bias for LoRA B #2237
ENH Argument to enable bias for LoRA B #2237
Conversation
This PR adds the argument lora_bias which, if set to True (default: False), adds a bias term to the LoRA B module. Typically, this should be disabled. The main use case is when the LoRA weights were extracted from fully fine-tuned parameters, so the bias of those parameters can be taken into account. Merging is supported for this argument when using vanilla LoRA layers or bitsandbytes LoRA layers. Other types of LoRA layers don't support merging. This option is also disabled for non-standard LoRA weight initialization like LoftQ, as well as for embedding layers (since they use nn.Parameter).
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this ready so quickly!
My major comments are:
- Should consider the argument to be
lora_B_bias
as opposed tolora_bias
to be more informative at the expense two extra characters? - There seems to be a logical error when updating the bias value (as indicated in the comments).
- I didn't notice if we're raising any errors when unsupported configurations (such as the ones described in the OP) are detected. Maybe we should check that (if not already) and test?
@@ -35,13 +35,27 @@ def __init__( | |||
lora_dropout: float = 0.0, | |||
init_lora_weights: bool = True, | |||
use_rslora: bool = False, | |||
use_dora: bool = False, | |||
lora_bias: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be lora_B_bias
? I find that to be a bit more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I considered this. My main reasoning for going with the more generic lora_bias
was that it leaves the door open for extending this argument in the future. Say, someone finds that LoRA works much better when also adding a bias to LoRA A, then we can adopt this argument to allow this too. Otherwise, we'd have to add a new argument (and we don't want to rename arguments for obvious reasons). LMK what you think of that reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, we'd have to add a new argument (and we don't want to rename arguments for obvious reasons).
I think that would still be preferrable over having a single argument for controlling the bias setup for LoRAs as I think it's still in its infancy.
Later it if it becomes a common standard to add biases for both LoRA matrices we can deprecate lora_B_bias
and lora_A_bias
(if we introduce such an argument) to have a single argument called lora_bias
.
This is where I stand, but I am not too opinionated about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about reproducibility after upgrading PEFT? Then it seems detrimental to possibly merge control of A and B biases into one flag in the future and they should be separated into two flags from the start.
Otherwise, I think in terms of opportunity cost for experimentation on the user's side having two separate parameters (lora_bias_A
, lora_bias_B
) is better. That said, having only one parameter appears to be simpler: let the implementation decide what the current best thing is for adding biases. So if you are just someone who wants to do LoRA best-current-practice it would be helpful to only have one flag. This becomes harder with two flags since there is no obvious 'no bias at all' vs. 'best-practice' setting. If we have simplicity first (and don't care about reproducibility after upgrading) then one parameter is the way to go, I think. What's the stance here?
Ideally there would be another layer of abstraction, a more low-level abstraction, that has two bias parameters and one above that which decides what the best choice is at the moment. I.e. BaseLoRA(..., lora_bias_A, lora_bias_B) -> LoRA(..., lora_bias)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, my idea is that if we want to later add the possibility for a bias for LoRA A, the option would be something like lora_bias="a"
, or for both, lora_bias="both"
. We should not change the meaning of lora_bias=True
, in order to ensure reproducibility, as you mentioned.
If we find that the parameter gets overloaded, we can add the option for a sub-config, so LoraConfig(..., lora_bias=LoraBiasConfig(bias_a=True, bias_b=True, ...))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like lora_bias
should be fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I merged the PR as is.
atol = 0.01 | ||
rtol = 10 | ||
assert not torch.allclose(out_base, out_before_merge, atol=atol, rtol=rtol) | ||
assert torch.allclose(out_before_merge, out_after_merge, atol=atol, rtol=rtol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be the reason as to why we need a high tolerance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we already had the high tolerance before:
Lines 848 to 849 in d13d7a4
atol = 0.01 | |
rtol = 10 |
I tried fiddling with these values to find something where the test was pass with the correct merging implementation and fail with the LoRA bias merging being commented out but couldn't find values that would fit. Maybe there are very narrow values that would work but if the tolerance is too narrow, the test could be unreliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
I didn't notice if we're raising any errors when unsupported configurations (such as the ones described in the OP) are detected. Maybe we should check that (if not already) and test?
I added checks to the quantized layers if they support merging, as it's only relevant there, so there is not a check for each one:
- https://github.com/huggingface/peft/pull/2237/files#diff-86aac165111c8b8eae9b68ff07210aa69d2136a6e68c2002ac70de8050132063R47-R48
- https://github.com/huggingface/peft/pull/2237/files#diff-97c2ced1de511b00d44fa7538aa83b1c9baba12a1f30e21389b4fad8a0a32cf0R35-R36
- https://github.com/huggingface/peft/pull/2237/files#diff-5e62802c7cd4e536708b9ad1b5bfe866d307630079d95d234c2e5cd022aabd9eR56-R57
As for incompatible configs, I tested them here:
Does this answer your question?
atol = 0.01 | ||
rtol = 10 | ||
assert not torch.allclose(out_base, out_before_merge, atol=atol, rtol=rtol) | ||
assert torch.allclose(out_before_merge, out_after_merge, atol=atol, rtol=rtol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we already had the high tolerance before:
Lines 848 to 849 in d13d7a4
atol = 0.01 | |
rtol = 10 |
I tried fiddling with these values to find something where the test was pass with the correct merging implementation and fail with the LoRA bias merging being commented out but couldn't find values that would fit. Maybe there are very narrow values that would work but if the tolerance is too narrow, the test could be unreliable.
@@ -35,13 +35,27 @@ def __init__( | |||
lora_dropout: float = 0.0, | |||
init_lora_weights: bool = True, | |||
use_rslora: bool = False, | |||
use_dora: bool = False, | |||
lora_bias: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I considered this. My main reasoning for going with the more generic lora_bias
was that it leaves the door open for extending this argument in the future. Say, someone finds that LoRA works much better when also adding a bias to LoRA A, then we can adopt this argument to allow this too. Otherwise, we'd have to add a new argument (and we don't want to rename arguments for obvious reasons). LMK what you think of that reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to me and thanks one again for getting this up so quickly!
This PR adds the argument
lora_bias
which, if set toTrue
(default:False
), adds a bias term to the LoRA B module.Typically, this should be disabled. The main use case is when the LoRA weights were extracted from fully fine-tuned parameters, so the bias of those parameters can be taken into account.
Merging is supported for this argument when using vanilla LoRA layers or bitsandbytes LoRA layers. Other types of LoRA layers don't support merging.
This option is also disabled for non-standard LoRA weight initialization like LoftQ, as well as for embedding layers (since they use
nn.Parameter
, i.e. there is no bias term).Notes: