We've conducted distributed fine tune experiment on our WizardLM utilizing original Llama-X project. Give the same hyperparameter as the Fine-tuning section, we expand our experiment on multi nodes. To reproduce our experiments, we provided the steps and system configuration here.
We assume you have worker-0, worker-1, worker-2 which are GPU nodes to be used for training and they could ssh into each other via private key. We assume worker-0 is the master node here, which has a opened port MASTER_PORT that worker-1 and worker-2 can directly access and it has a MASTER_IP that other nodes can access.
In each worker, configure your enviorment using the instructions in Llama-X. Different workers should use the same absolute path in your data, output, code folder and they should be exactly the same configuration.
After that, we need to change the hostfile config(/path/to/Llama-X/src/configs/hostfile) in each node, and add each worker into it, assuming 8 GPUs on each worker:
worker-0 slots=8
worker-1 slots=8
worker-2 slots=8
And since there might be some NCCL commuication problem considering the complexity of every cluster, we recommend use this config:
NCCL_DEBUG=INFO
NCCL_ASYNC_ERROR_HANDLING=1
NCCL_BUFFSIZE=2097152
The good way is to write those variable into each nodes ".deepspeed_env" file in your home folder. You can refer this source for how-to: multi-node-environment-variables
Finally, everything is set up and run this command in worker-0. Enjoy your flight!
deepspeed --num_gpus 8 \
--num_nodes 2 \
--master_addr $MASTER_IP \
--master_port $MASTER_PORT \
--hostfile /path/to/Llama-X/src/configs/hostfile \
train_freeform.py \
--model_name_or_path /path/to/llama-7B/hf \
--data_path /path/to/alpaca_evol_instruct_70k.json \
--output_dir /path/to/wizardlm-7B/hf/ft \
--num_train_epochs 3 \
--model_max_length 2048 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 800 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--warmup_steps 2 \
--logging_steps 2 \
--lr_scheduler_type "cosine" \
--report_to "tensorboard" \
--gradient_checkpointing True \
--deepspeed configs/deepspeed_config.json \
--fp16 True
Here are some common problem we could see from the console output:
- "Call to ibv_reg_mr failed"
- "ib_plugin.c:670 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129"
As long as you have IB in your system, this problem could be triggered by missed configuration of ulimit. If you run your experiment in a docker container, this option could be used for unlock ulimit limitation.
docker ... --ulimit memlock=-1
Or you can use this solution from NCCL official document troubleshooting.html#infiniband and make sure you login each worker as ROOT user or use root privilege to break the limitation.
The other issue is that you don't have IB and only have normal network card in each worker, you can use those config in .deepspeed_env to disable IB and use network to communicate:
NCCL_DEBUG=INFO
NCCL_P2P_DISABLE=1
NCCL_ASYNC_ERROR_HANDLING=1
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=ens9f1
NCCL_SOCKET_IFNAME needs to be changed to your worker's actual newtwork interface name, using ifconfig to find out.