Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡同时启动多个模型副本概率遇到问题 #543

Open
1 of 2 tasks
UbeCc opened this issue Sep 4, 2024 · 0 comments
Open
1 of 2 tasks

单机多卡同时启动多个模型副本概率遇到问题 #543

UbeCc opened this issue Sep 4, 2024 · 0 comments
Assignees

Comments

@UbeCc
Copy link

UbeCc commented Sep 4, 2024

System Info / 系統信息

transformers==4.42.3

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

同时启动多个 glm4 实例

CUDA_LIST=(0 1 2 3 4 5 6 7)
export PORTS=(18008 18009 18010 18011 18012 18013 18014 18015)
CUDA_LIST_LENGTH=${#CUDA_LIST[@]}
for i in $(seq 0 $((${CUDA_LIST_LENGTH}-1))); do
    export PORT=${PORTS[i]}
    echo Start vllm on ${PORT}
    export CUDA_VISIBLE_DEVICES=${CUDA_LIST[i]}
    nohup python -m vllm.entrypoints.openai.api_server --model ${MODEL} --port ${PORT} --trust-remote-code --gpu-memory-utilization 0.9 --max-model-len 8192 &
done

有一定概率部分模型启动失败,已收到可能错误原因如下
原因 1

module 'transformers_modules.configuration_chatglm' has no attribute 'ChatGLMConfig'

原因 2

[2024-09-04 17:27:43]   File "OpenRLHF/examples/scripts/../../openrlhf/cli/train_sft.py", line 219, in <module>
[2024-09-04 17:27:43]     train(args)
[2024-09-04 17:27:43]   File "OpenRLHF/examples/scripts/../../openrlhf/cli/train_sft.py", line 34, in train
[2024-09-04 17:27:43]     tokenizer = get_tokenizer(args.pretrain, model.model, "right", strategy, use_fast=not args.disable_fast_tokenizer)
[2024-09-04 17:27:43]   File "OpenRLHF/openrlhf/utils/utils.py", line 16, in get_tokenizer
[2024-09-04 17:27:43]     tokenizer = AutoTokenizer.from_pretrained(pretrain, trust_remote_code=True, use_fast=use_fast)
[2024-09-04 17:27:43]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 870, in from_pretrained
[2024-09-04 17:27:43]     tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
[2024-09-04 17:27:43]   File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 514, in get_class_from_dynamic_module
[2024-09-04 17:27:43]     return get_class_in_module(class_name, final_module)
[2024-09-04 17:27:43]   File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 212, in get_class_in_module
[2024-09-04 17:27:43]     module_spec.loader.exec_module(module)
[2024-09-04 17:27:43]   File "<frozen importlib._bootstrap_external>", line 879, in exec_module
[2024-09-04 17:27:43]   File "<frozen importlib._bootstrap_external>", line 1017, in get_code
[2024-09-04 17:27:43]   File "<frozen importlib._bootstrap_external>", line 947, in source_to_code
[2024-09-04 17:27:43]   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[2024-09-04 17:27:43]   File "/root/.cache/huggingface/modules/transformers_modules/tokenization_chatglm.py", line 240
[2024-09-04 17:27:43]     """
[2024-09-04 17:27:43]     ^
[2024-09-04 17:27:43] SyntaxError: unterminated triple-quoted string literal (detected at line 254)

此错误概率出现,且仅在微调 glm 时出现,其他模型系列目前并未观察到。

Expected behavior / 期待表现

见上

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants