step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

victorjiax · 2024-02-23T09:27:00Z

when load the optimizer.pt display the key is different
KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'

the items in optimizer.pt state is 0~255.

whi497 · 2024-02-29T11:03:18Z

Same error, have you solved it?

xiamengzhou · 2024-02-29T17:09:46Z

Hi, what transformers version are you using? I updated the requirement file to specify transformers==4.36.2.

JPegah · 2024-03-17T05:37:42Z

I am getting the same error despite using the same transformers version!

leopoldwhite · 2024-03-24T03:36:47Z

Hi, what transformers version are you using? I updated the requirement file to specify transformers==4.36.2.

Same error using transformers==4.36.2.

xiamengzhou · 2024-03-24T12:01:24Z

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

GCYZSL · 2024-03-29T04:03:25Z

Hi, thank you for your solution. I added the arguments, and there is a new error:

RuntimeError: Cannot writeback when the parameter shape changes
Expects torch.Size([131076096]) but got torch.Size([32001, 4096])

xiamengzhou · 2024-03-31T23:24:58Z

It seems to be a flatten issue, could you provide the script and code you ran?

GCYZSL · 2024-04-01T03:23:57Z

Thank you for your response! I run the warmup_lora_train.sh. It runs well before adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune. I added the arguments in the warmup_lora_train.sh as follows:

training_args="$base_training_args \
--model_name_or_path $model_path \
--output_dir $output_dir \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

RrankPyramid · 2024-04-08T07:11:39Z

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@xiamengzhou Hi, I got the same error (KeyError: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight), while I am loading optimizer.pt instead of optimizer.bin. Is there a way to solve this?

Tantor-D · 2024-04-11T15:47:41Z

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.
I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@xiamengzhou Hi, I got the same error (KeyError: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight), while I am loading optimizer.pt instead of optimizer.bin. Is there a way to solve this?

I encountered the same error. When I tried running it without the --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune settings, I received optimizer.pt. Then, after modifying the code in get_info.py from optimizer.bin to optimizer.pt, I encountered a "KeyError" related to 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'. Has anyone found a solution to this issue?

xiamengzhou · 2024-04-11T15:56:30Z

@Tantor-D @RrankPyramid Could you check what the keys are like in your optimizer.pt file?

Tantor-D · 2024-04-11T17:04:53Z

@xiamengzhou
Thank you for your reply! It seems I've identified an issue: The keys in the adam_optimizer_state dictionary appear as

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])

However, the names list retrieved in the prepare_optimizer_state function of collect_grad_reps.py shows different information, indicating that the saved optimizer.pt may not be correctly storing key-value-based optimization states.

the names list appear as:

['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 
'base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight',

I will add --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to warmup_lora_train.sh and run again. Thanks again for your reply

tengerye · 2024-05-04T06:47:14Z

Hi @Tantor-D , have you found a solution yet?

After I add --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune.

I have a new error:

05/04/2024 06:43:26 - WARNING - accelerate.accelerator - FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
Traceback (most recent call last):
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tye/workspace/less_influence/LESS/less/train/train.py", line 181, in <module>
    main()
  File "/data/tye/workspace/less_influence/LESS/less/train/train.py", line 161, in main
    train_result = trainer.train()
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1270, in prepare
    result = tuple(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1083, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1429, in prepare_model
    model = FSDP(model, **kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
    _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 2 more times]
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
    self._init_flat_param(params, fully_sharded_module, use_orig_params)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 440, in _init_flat_param
    raise ValueError(
ValueError: `FlatParameter` requires uniform `requires_grad`

Tantor-D · 2024-05-04T06:55:14Z

@tengerye I solve the error by adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to less/scripts/train/warmup_lora_train.sh. The code works well.

Here is the changed version.

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

tengerye · 2024-05-04T12:24:32Z

@Tantor-D Thank you so much for your kind reply. My problem came from the wrong version of my environment packages and it has been solved.

shangqing-liu · 2024-05-19T06:39:42Z

Hi @xiamengzhou I have another question about the code. After I tested the code, I found that we need to have two round warmup training, as first, I need to disable --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to finish a round of train and get the optimizer1.bin and then using --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune for another training to get optimizer2.bin. After that, I have to move the optimizer2.bin to optimizer1.bin due to the key problem like KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'.

Hence, may I ask the way to merge both training and use one round of training to get the warmup model?

Thanks.

shangqing-liu · 2024-05-19T10:51:51Z

The problem has been solved. Thanks

mihara-bot · 2024-05-21T09:18:48Z

@tengerye I solve the error by adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to less/scripts/train/warmup_lora_train.sh. The code works well.

Here is the changed version.
training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

Hi, I use this code to run smoonthly for Step 1 but at Step 2 I encountered the optimizer,bin not found problem.
#18
Would you please kindly help me on it?
Best regards

Yupei-Du · 2024-06-30T16:03:23Z

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

I have a very basic workaround for this index-value-based file, probably there are bugs but so far it seem to work

from transformers.optimization import AdamW
from transformers.trainer_pt_utils import get_parameter_names
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS

def load_adam_state(model, optimizer_state_path):
    opt_grouped_parameters = [{'weight_decay': 0.0}, {'weight_decay': 0.0}]
    opt_grouped_parameter_names = [None, None]

    decay_parameters = [name for name in get_parameter_names(model, ALL_LAYERNORM_LAYERS) if 'bias' not in name]
    opt_grouped_parameters[0]['params'], opt_grouped_parameter_names[0] = zip(*[
        (p, n) for n, p in model.named_parameters() if n in decay_parameters and p.requires_grad])
    param_name_to_size_dict = {n: p.size() for n, p in model.named_parameters() if p.requires_grad}
    if len(param_name_to_size_dict) != len(opt_grouped_parameter_names[0]):
        opt_grouped_parameters[1]['params'], opt_grouped_parameter_names[1] = zip(*[
            (p, n) for n, p in model.named_parameters() if n not in decay_parameters and p.requires_grad])
    else:
        opt_grouped_parameters[1]['params'], opt_grouped_parameter_names[1] = [], []

    optimizer = AdamW(opt_grouped_parameters)
    optimizer.load_state_dict(torch.load(optimizer_state_path, map_location='cpu'))
    saved_state_dict = optimizer.state_dict()

    param_name_to_saved_state_dict = {}
    for group_idx in range(len(saved_state_dict['param_groups'])):
        group_param_indices = saved_state_dict['param_groups'][group_idx]['params']
        group_param_names = opt_grouped_parameter_names[group_idx]
        for param_idx, param_name in zip(group_param_indices, group_param_names):
            param_size = param_name_to_size_dict[param_name]
            exp_avg = saved_state_dict['state'][param_idx]['exp_avg']
            exp_avg_sq = saved_state_dict['state'][param_idx]['exp_avg_sq']
            assert exp_avg.size() == param_size
            param_name_to_saved_state_dict[param_name] = {'exp_avg': exp_avg, 'exp_avg_sq': exp_avg_sq}
    
    return param_name_to_saved_state_dict

amy-77 · 2024-11-28T15:05:31Z

When I add fsdp in training scripts，

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"   , 
...
The new error is    
raise EnvironmentError(
OSError: /hpc2hdd/home/bli303/dc/out/llama2-7b-p0.05-lora-seed3_fsdp/checkpoint-105 does not appear to have a file named config.json. Checkout 'https://huggingface.co//hpc2hdd/home/bli303/dc/out/llama2-7b-p0.05-lora-seed3_fsdp/checkpoint-105/None' for available files, there is no config.json has been saved.   it only

contais (envdc) bli303@9b72496437af:~/dc/out/llama2-7b-p0.05-lora-seed3_fsdp/checkpoint-105$ ls
optimizer.bin rng_state_1.pth rng_state_4.pth rng_state_7.pth
pytorch_model.bin rng_state_2.pth rng_state_5.pth scheduler.pt
rng_state_0.pth rng_state_3.pth rng_state_6.pth trainer_state.json
'''

xiamengzhou mentioned this issue Jun 17, 2024

No optimizer.bin in Step 2 #18

Open

timturing mentioned this issue Jun 28, 2024

At step1, single GPU works while multiple GPUs get stuck. #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

victorjiax commented Feb 23, 2024

whi497 commented Feb 29, 2024

xiamengzhou commented Feb 29, 2024

JPegah commented Mar 17, 2024

leopoldwhite commented Mar 24, 2024

xiamengzhou commented Mar 24, 2024

GCYZSL commented Mar 29, 2024 •

edited

Loading

xiamengzhou commented Mar 31, 2024

GCYZSL commented Apr 1, 2024

RrankPyramid commented Apr 8, 2024

Tantor-D commented Apr 11, 2024 •

edited

Loading

xiamengzhou commented Apr 11, 2024

Tantor-D commented Apr 11, 2024

tengerye commented May 4, 2024

Tantor-D commented May 4, 2024

tengerye commented May 4, 2024 •

edited

Loading

shangqing-liu commented May 19, 2024

shangqing-liu commented May 19, 2024

mihara-bot commented May 21, 2024

Yupei-Du commented Jun 30, 2024

amy-77 commented Nov 28, 2024

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

Comments

victorjiax commented Feb 23, 2024

whi497 commented Feb 29, 2024

xiamengzhou commented Feb 29, 2024

JPegah commented Mar 17, 2024

leopoldwhite commented Mar 24, 2024

xiamengzhou commented Mar 24, 2024

GCYZSL commented Mar 29, 2024 • edited Loading

xiamengzhou commented Mar 31, 2024

GCYZSL commented Apr 1, 2024

RrankPyramid commented Apr 8, 2024

Tantor-D commented Apr 11, 2024 • edited Loading

xiamengzhou commented Apr 11, 2024

Tantor-D commented Apr 11, 2024

tengerye commented May 4, 2024

Tantor-D commented May 4, 2024

tengerye commented May 4, 2024 • edited Loading

shangqing-liu commented May 19, 2024

shangqing-liu commented May 19, 2024

mihara-bot commented May 21, 2024

Yupei-Du commented Jun 30, 2024

amy-77 commented Nov 28, 2024

GCYZSL commented Mar 29, 2024 •

edited

Loading

Tantor-D commented Apr 11, 2024 •

edited

Loading

tengerye commented May 4, 2024 •

edited

Loading