Multi-node Training #9

chtmp223 · 2024-11-26T16:48:41Z

Hello,

Thank you for the great work! I've been fine-tuning Prolong on my own dataset and it works really well.

I am curious if you have trained with multiple nodes before. I saw that train_sft.sh could handle more than 1 node (line 67-77). However, when I modified the code to work on my custom dataset, I always got torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError, even when rdzv-conf timeout is set to 600.

Here is my code if it helps. I want to fine-tuning using 8 A100s across 2 nodes:

#!/bin/bash -l
#SBATCH --gpus-per-node=4
#SBATCH --nodes=2
#SBATCH --mem=600GB
#SBATCH --constraint=a100-80g
#SBATCH -c 16

# Setup ---
model=${MODEL:-"/models/Llama-3-8B-ProLong-512k-Base/"}
model_alias=${MODEL_ALIAS:-prolong-512K-base}
bsz=${BSZ:-16}
seq=${SEQ:-1}
lr=${LR:-1e-6}
steps=${STEPS:-125}
save_steps=${SAVE:-50}
warmup=${WARMUP:-0.05}
dataset=${DATASET:-"<data_path>"}
dataset_size=${DATASET_SIZE:-2000}

# Detect number of GPUs per node --- 
if [ -z "$CUDA_VISIBLE_DEVICES" ]; then
    num_gpus_per_node=$(nvidia-smi -L | wc -l)
else
    num_gpus_per_node=$(jq -n "[$CUDA_VISIBLE_DEVICES] | length")
fi
num_gpus_per_node=${NUM_GPUS_PER_NODE:-$num_gpus_per_node}

# Detect number of nodes --- 
num_nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l)
num_nodes=${NUM_NODES:-$num_nodes}
total_gpus=$(($num_gpus_per_node * $num_nodes))
accu=$(($bsz / $seq / $total_gpus))

mkdir -p $out_dir
nvidia-smi

# Launch torchrun based on the number of nodes ---
if [ $num_nodes -gt 1 ]; then
    master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
    master_addr=${MASTER_ADDR:-$master_addr}

    header="torchrun \              # omitted srun because it throws invalid slurm specification error
    --rdzv-backend=c10d \
    --rdzv-endpoint=$master_addr:56321 \
    --nnodes=$num_nodes \
    --nproc-per-node=$num_gpus_per_node \
    --rdzv-conf timeout=600 \
    -m finetune.py"
else
    master_port=$(comm -23 <(seq 49152 65535 | sort) <(ss -Htan | awk '{print $4}' | cut -d':' -f2 | sort -u) | shuf | head -n 1)

    header="torchrun \
    --standalone \
    --rdzv-backend=c10d \
    --rdzv-endpoint=localhost:$master_port \
    --nnodes=1 \
    --nproc-per-node=$num_gpus_per_node \
    finetune.py"
fi

# Accumulated gradient calculation
echo "slurm_nodelist=${SLURM_NODELIST} num_nodes=${num_nodes} num_gpus_per_node=${num_gpus_per_node} total_gpus=${total_gpus} master_addr=${master_addr}"

run_name="multinode_${model_alias}_$(basename $dataset)_bsz-${bsz}_lr-${lr}_steps-${steps}"

out_dir="<out_dir>"
mkdir -p $out_dir

fsdp=${FSDP:-"1"}
gc=${GC:-"1"}
export OMP_NUM_THREADS=$num_gpus_per_node
export WANDB_PROJECT="prolong"
export WANDB_DIR=$out_dir
export WANDB_MODE="online"
export TOKENIZERS_PARALLELISM=true
export FSDP_SHARDING_STRATEGY="1"
export FSDP_STATE_DICT_TYPE="FULL_STATE_DICT"

base_arguments=(
    --apply_instruct_masks
    --token_scaled_loss
    --seq_parallel_size $num_gpus_per_node
    
    --report_to wandb
    --do_train

    --model_name_or_path $model
    --config_name $model
    --tokenizer_name $model

    --run_name $run_name
    --output_dir $out_dir
    --config_overrides_json "$overrides"
    --gradient_accumulation_steps $accu
    --per_device_train_batch_size $seq
    --per_device_eval_batch_size $seq

    --bf16
    --learning_rate $lr
    --min_lr_ratio 0.1
    --lr_scheduler_type cosine
    --max_grad_norm 1.0
    --adam_beta1 0.9
    --adam_beta2 0.95
    --weight_decay 0.1
    --warmup_ratio $warmup
    --optim adamw_torch

    --logging_steps 1
    --log_level info

    --max_steps $steps
    --save_steps $save_steps
    --dataloader_num_workers 1

    --disable_tqdm true
    --use_fast_tokenizer false
    --remove_unused_columns false
    --ddp_find_unused_parameters false

    --fsdp "auto_wrap"
    --gradient_checkpointing

    --tokenized_mds_train $dataset

    --cuda_empty_cache
)

echo command: "${header} ${base_arguments[@]}"
${header} "${base_arguments[@]}" 2>&1 | tee -a $out_dir/log.out

Curious to hear your insights!

The text was updated successfully, but these errors were encountered:

CodeCreator · 2024-11-27T17:44:23Z

If you use slurm then srun is important, because it makes sure that the job is launched on all nodes. Without these ever running, the head node gets the rendezvous timeout error. So probably try to understand why you get the slurm specification error with srun!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node Training #9

Multi-node Training #9

chtmp223 commented Nov 26, 2024

CodeCreator commented Nov 27, 2024

Multi-node Training #9

Multi-node Training #9

Comments

chtmp223 commented Nov 26, 2024

CodeCreator commented Nov 27, 2024