Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

Open
victorjiax opened this issue Feb 23, 2024 · 20 comments

Comments

@victorjiax
Copy link

when load the optimizer.pt display the key is different
KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'

the items in optimizer.pt state is 0~255.

@whi497
Copy link

whi497 commented Feb 29, 2024

Same error, have you solved it?

@xiamengzhou
Copy link
Collaborator

Hi, what transformers version are you using? I updated the requirement file to specify transformers==4.36.2.

@JPegah
Copy link

JPegah commented Mar 17, 2024

I am getting the same error despite using the same transformers version!

@leopoldwhite
Copy link
Contributor

Hi, what transformers version are you using? I updated the requirement file to specify transformers==4.36.2.

Same error using transformers==4.36.2.

@xiamengzhou
Copy link
Collaborator

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@GCYZSL
Copy link

GCYZSL commented Mar 29, 2024

Hi, thank you for your solution. I added the arguments, and there is a new error:

RuntimeError: Cannot writeback when the parameter shape changes
Expects torch.Size([131076096]) but got torch.Size([32001, 4096])

@xiamengzhou
Copy link
Collaborator

It seems to be a flatten issue, could you provide the script and code you ran?

@GCYZSL
Copy link

GCYZSL commented Apr 1, 2024

Thank you for your response! I run the warmup_lora_train.sh. It runs well before adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune. I added the arguments in the warmup_lora_train.sh as follows:

training_args="$base_training_args \
--model_name_or_path $model_path \
--output_dir $output_dir \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

@RrankPyramid
Copy link

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@xiamengzhou Hi, I got the same error (KeyError: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight), while I am loading optimizer.pt instead of optimizer.bin. Is there a way to solve this?

@Tantor-D
Copy link

Tantor-D commented Apr 11, 2024

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.
I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@xiamengzhou Hi, I got the same error (KeyError: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight), while I am loading optimizer.pt instead of optimizer.bin. Is there a way to solve this?

I encountered the same error. When I tried running it without the --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune settings, I received optimizer.pt. Then, after modifying the code in get_info.py from optimizer.bin to optimizer.pt, I encountered a "KeyError" related to 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'. Has anyone found a solution to this issue?

@xiamengzhou
Copy link
Collaborator

@Tantor-D @RrankPyramid Could you check what the keys are like in your optimizer.pt file?

@Tantor-D
Copy link

@xiamengzhou
Thank you for your reply! It seems I've identified an issue: The keys in the adam_optimizer_state dictionary appear as

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])

However, the names list retrieved in the prepare_optimizer_state function of collect_grad_reps.py shows different information, indicating that the saved optimizer.pt may not be correctly storing key-value-based optimization states.

the names list appear as:

['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 
'base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight',

I will add --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to warmup_lora_train.sh and run again. Thanks again for your reply

@tengerye
Copy link

tengerye commented May 4, 2024

Hi @Tantor-D , have you found a solution yet?

After I add --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune.

I have a new error:

05/04/2024 06:43:26 - WARNING - accelerate.accelerator - FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
Traceback (most recent call last):
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tye/workspace/less_influence/LESS/less/train/train.py", line 181, in <module>
    main()
  File "/data/tye/workspace/less_influence/LESS/less/train/train.py", line 161, in main
    train_result = trainer.train()
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1270, in prepare
    result = tuple(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1083, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1429, in prepare_model
    model = FSDP(model, **kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
    _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 2 more times]
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
    self._init_flat_param(params, fully_sharded_module, use_orig_params)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 440, in _init_flat_param
    raise ValueError(
ValueError: `FlatParameter` requires uniform `requires_grad`

@Tantor-D
Copy link

Tantor-D commented May 4, 2024

@tengerye I solve the error by adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to less/scripts/train/warmup_lora_train.sh. The code works well.

Here is the changed version.

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

@tengerye
Copy link

tengerye commented May 4, 2024

@Tantor-D Thank you so much for your kind reply. My problem came from the wrong version of my environment packages and it has been solved.

@shangqing-liu
Copy link

Hi @xiamengzhou I have another question about the code. After I tested the code, I found that we need to have two round warmup training, as first, I need to disable --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to finish a round of train and get the optimizer1.bin and then using --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune for another training to get optimizer2.bin. After that, I have to move the optimizer2.bin to optimizer1.bin due to the key problem like KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'.

Hence, may I ask the way to merge both training and use one round of training to get the warmup model?

Thanks.

@shangqing-liu
Copy link

The problem has been solved. Thanks

@mihara-bot
Copy link

@tengerye I solve the error by adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to less/scripts/train/warmup_lora_train.sh. The code works well.

Here is the changed version.

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

Hi, I use this code to run smoonthly for Step 1 but at Step 2 I encountered the optimizer,bin not found problem.
#18
Would you please kindly help me on it?
Best regards

@Yupei-Du
Copy link

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

I have a very basic workaround for this index-value-based file, probably there are bugs but so far it seem to work

from transformers.optimization import AdamW
from transformers.trainer_pt_utils import get_parameter_names
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS

def load_adam_state(model, optimizer_state_path):
    opt_grouped_parameters = [{'weight_decay': 0.0}, {'weight_decay': 0.0}]
    opt_grouped_parameter_names = [None, None]

    decay_parameters = [name for name in get_parameter_names(model, ALL_LAYERNORM_LAYERS) if 'bias' not in name]
    opt_grouped_parameters[0]['params'], opt_grouped_parameter_names[0] = zip(*[
        (p, n) for n, p in model.named_parameters() if n in decay_parameters and p.requires_grad])
    param_name_to_size_dict = {n: p.size() for n, p in model.named_parameters() if p.requires_grad}
    if len(param_name_to_size_dict) != len(opt_grouped_parameter_names[0]):
        opt_grouped_parameters[1]['params'], opt_grouped_parameter_names[1] = zip(*[
            (p, n) for n, p in model.named_parameters() if n not in decay_parameters and p.requires_grad])
    else:
        opt_grouped_parameters[1]['params'], opt_grouped_parameter_names[1] = [], []

    optimizer = AdamW(opt_grouped_parameters)
    optimizer.load_state_dict(torch.load(optimizer_state_path, map_location='cpu'))
    saved_state_dict = optimizer.state_dict()

    param_name_to_saved_state_dict = {}
    for group_idx in range(len(saved_state_dict['param_groups'])):
        group_param_indices = saved_state_dict['param_groups'][group_idx]['params']
        group_param_names = opt_grouped_parameter_names[group_idx]
        for param_idx, param_name in zip(group_param_indices, group_param_names):
            param_size = param_name_to_size_dict[param_name]
            exp_avg = saved_state_dict['state'][param_idx]['exp_avg']
            exp_avg_sq = saved_state_dict['state'][param_idx]['exp_avg_sq']
            assert exp_avg.size() == param_size
            param_name_to_saved_state_dict[param_name] = {'exp_avg': exp_avg, 'exp_avg_sq': exp_avg_sq}
    
    return param_name_to_saved_state_dict

@amy-77
Copy link

amy-77 commented Nov 28, 2024

When I add fsdp in training scripts,

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"   , 
...
The new error is    
raise EnvironmentError(
OSError: /hpc2hdd/home/bli303/dc/out/llama2-7b-p0.05-lora-seed3_fsdp/checkpoint-105 does not appear to have a file named config.json. Checkout 'https://huggingface.co//hpc2hdd/home/bli303/dc/out/llama2-7b-p0.05-lora-seed3_fsdp/checkpoint-105/None' for available files, there is no config.json has been saved.   it only 

contais (envdc) bli303@9b72496437af:~/dc/out/llama2-7b-p0.05-lora-seed3_fsdp/checkpoint-105$ ls
optimizer.bin rng_state_1.pth rng_state_4.pth rng_state_7.pth
pytorch_model.bin rng_state_2.pth rng_state_5.pth scheduler.pt
rng_state_0.pth rng_state_3.pth rng_state_6.pth trainer_state.json
'''

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests