Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exits with return code = -7 #20

Open
gapjialin opened this issue Mar 20, 2024 · 0 comments
Open

exits with return code = -7 #20

gapjialin opened this issue Mar 20, 2024 · 0 comments

Comments

@gapjialin
Copy link

Hello, when I want to use Lora for fine-tuning, no matter how I lower the parameters, the following error will be reported. I am using 8xA40,
(geochat) root@2170f15b1d25:/home/GeoChat/scripts# bash finetune_lora.sh
[2024-03-20 09:49:44,405] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:46,776] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-20 09:49:46,776] [INFO] [runner.py:555:main] cmd = /root/miniconda3/envs/geochat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=21205 --enable_each_rank_log=None /home/GeoChat/geochat/train/train_mem.py --deepspeed /home/GeoChat/scripts/zero2.json --lora_enable True --model_name_or_path /home/LLaVA/llava-v1.5-7b --version v1 --data_path /home/LLaVA-HR/NEWrailwaytrain.json --image_folder /home/LLaVA/data --vision_tower /home/LLaVA/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --pretrain_mm_mlp_adapter /home/LLaVA/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --bf16 True --output_dir /home/GeoChat/checkpoints_dir --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy epoch --save_steps 7000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --dataloader_num_workers 4 --report_to wandb
[2024-03-20 09:49:48,460] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.17.1-1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1
[2024-03-20 09:49:50,890] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-03-20 09:49:50,890] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-03-20 09:49:50,890] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-03-20 09:49:50,890] [INFO] [launch.py:163:main] dist_world_size=8
[2024-03-20 09:49:50,890] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-03-20 09:49:54,389] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,650] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,718] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,741] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,752] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:55,040] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,040] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,284] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,284] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,336] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,336] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,336] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-20 09:49:55,340] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,340] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,349] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,349] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,361] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,361] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,362] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,362] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,411] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,411] [INFO] [comm.py:594:init_distributed] cdb=None
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:24<00:00, 72.26s/it]

Adding LoRA adapters...
Formatting inputs...Skip in lazy mode
[2024-03-20 09:57:14,398] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7453
[2024-03-20 09:57:14,401] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7454
[2024-03-20 09:57:14,401] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7455
[2024-03-20 09:57:15,060] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7456
[2024-03-20 09:57:15,062] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7457
[2024-03-20 09:57:15,064] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7458
[2024-03-20 09:57:15,885] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7459
[2024-03-20 09:57:15,887] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7460
[2024-03-20 09:57:15,888] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/geochat/bin/python', '-u', '/home/GeoChat/geochat/train/train_mem.py', '--local_rank=7', '--deepspeed', '/home/GeoChat/scripts/zero2.json', '--lora_enable', 'True', '--model_name_or_path', '/home/LLaVA/llava-v1.5-7b', '--version', 'v1', '--data_path', '/home/LLaVA-HR/NEWrailwaytrain.json', '--image_folder', '/home/LLaVA/data', '--vision_tower', '/home/LLaVA/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/LLaVA/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--bf16', 'True', '--output_dir', '/home/GeoChat/checkpoints_dir', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'epoch', '--save_steps', '7000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--lazy_preprocess', 'True', '--dataloader_num_workers', '4', '--report_to', 'wandb'] exits with return code = -7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant