exits with return code = -7 #20

gapjialin · 2024-03-20T10:02:19Z

Hello, when I want to use Lora for fine-tuning, no matter how I lower the parameters, the following error will be reported. I am using 8xA40，
(geochat) root@2170f15b1d25:/home/GeoChat/scripts# bash finetune_lora.sh
[2024-03-20 09:49:44,405] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:46,776] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-20 09:49:46,776] [INFO] [runner.py:555:main] cmd = /root/miniconda3/envs/geochat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=21205 --enable_each_rank_log=None /home/GeoChat/geochat/train/train_mem.py --deepspeed /home/GeoChat/scripts/zero2.json --lora_enable True --model_name_or_path /home/LLaVA/llava-v1.5-7b --version v1 --data_path /home/LLaVA-HR/NEWrailwaytrain.json --image_folder /home/LLaVA/data --vision_tower /home/LLaVA/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --pretrain_mm_mlp_adapter /home/LLaVA/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --bf16 True --output_dir /home/GeoChat/checkpoints_dir --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy epoch --save_steps 7000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --dataloader_num_workers 4 --report_to wandb
[2024-03-20 09:49:48,460] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.17.1-1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1
[2024-03-20 09:49:50,890] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-03-20 09:49:50,890] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-03-20 09:49:50,890] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-03-20 09:49:50,890] [INFO] [launch.py:163:main] dist_world_size=8
[2024-03-20 09:49:50,890] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-03-20 09:49:54,389] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,650] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,718] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,741] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,752] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:55,040] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,040] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,284] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,284] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,336] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,336] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,336] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-20 09:49:55,340] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,340] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,349] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,349] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,361] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,361] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,362] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,362] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,411] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,411] [INFO] [comm.py:594:init_distributed] cdb=None
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:24<00:00, 72.26s/it]

Adding LoRA adapters...
Formatting inputs...Skip in lazy mode
[2024-03-20 09:57:14,398] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7453
[2024-03-20 09:57:14,401] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7454
[2024-03-20 09:57:14,401] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7455
[2024-03-20 09:57:15,060] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7456
[2024-03-20 09:57:15,062] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7457
[2024-03-20 09:57:15,064] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7458
[2024-03-20 09:57:15,885] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7459
[2024-03-20 09:57:15,887] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7460
[2024-03-20 09:57:15,888] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/geochat/bin/python', '-u', '/home/GeoChat/geochat/train/train_mem.py', '--local_rank=7', '--deepspeed', '/home/GeoChat/scripts/zero2.json', '--lora_enable', 'True', '--model_name_or_path', '/home/LLaVA/llava-v1.5-7b', '--version', 'v1', '--data_path', '/home/LLaVA-HR/NEWrailwaytrain.json', '--image_folder', '/home/LLaVA/data', '--vision_tower', '/home/LLaVA/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/LLaVA/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--bf16', 'True', '--output_dir', '/home/GeoChat/checkpoints_dir', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'epoch', '--save_steps', '7000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--lazy_preprocess', 'True', '--dataloader_num_workers', '4', '--report_to', 'wandb'] exits with return code = -7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exits with return code = -7 #20

exits with return code = -7 #20

gapjialin commented Mar 20, 2024

exits with return code = -7 #20

exits with return code = -7 #20

Comments

gapjialin commented Mar 20, 2024