Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node Training #9

Open
chtmp223 opened this issue Nov 26, 2024 · 1 comment
Open

Multi-node Training #9

chtmp223 opened this issue Nov 26, 2024 · 1 comment

Comments

@chtmp223
Copy link

Hello,

Thank you for the great work! I've been fine-tuning Prolong on my own dataset and it works really well.

I am curious if you have trained with multiple nodes before. I saw that train_sft.sh could handle more than 1 node (line 67-77). However, when I modified the code to work on my custom dataset, I always got torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError, even when rdzv-conf timeout is set to 600.

Here is my code if it helps. I want to fine-tuning using 8 A100s across 2 nodes:

#!/bin/bash -l
#SBATCH --gpus-per-node=4
#SBATCH --nodes=2
#SBATCH --mem=600GB
#SBATCH --constraint=a100-80g
#SBATCH -c 16

# Setup ---
model=${MODEL:-"/models/Llama-3-8B-ProLong-512k-Base/"}
model_alias=${MODEL_ALIAS:-prolong-512K-base}
bsz=${BSZ:-16}
seq=${SEQ:-1}
lr=${LR:-1e-6}
steps=${STEPS:-125}
save_steps=${SAVE:-50}
warmup=${WARMUP:-0.05}
dataset=${DATASET:-"<data_path>"}
dataset_size=${DATASET_SIZE:-2000}

# Detect number of GPUs per node --- 
if [ -z "$CUDA_VISIBLE_DEVICES" ]; then
    num_gpus_per_node=$(nvidia-smi -L | wc -l)
else
    num_gpus_per_node=$(jq -n "[$CUDA_VISIBLE_DEVICES] | length")
fi
num_gpus_per_node=${NUM_GPUS_PER_NODE:-$num_gpus_per_node}

# Detect number of nodes --- 
num_nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l)
num_nodes=${NUM_NODES:-$num_nodes}
total_gpus=$(($num_gpus_per_node * $num_nodes))
accu=$(($bsz / $seq / $total_gpus))

mkdir -p $out_dir
nvidia-smi

# Launch torchrun based on the number of nodes ---
if [ $num_nodes -gt 1 ]; then
    master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
    master_addr=${MASTER_ADDR:-$master_addr}

    header="torchrun \              # omitted srun because it throws invalid slurm specification error
    --rdzv-backend=c10d \
    --rdzv-endpoint=$master_addr:56321 \
    --nnodes=$num_nodes \
    --nproc-per-node=$num_gpus_per_node \
    --rdzv-conf timeout=600 \
    -m finetune.py"
else
    master_port=$(comm -23 <(seq 49152 65535 | sort) <(ss -Htan | awk '{print $4}' | cut -d':' -f2 | sort -u) | shuf | head -n 1)

    header="torchrun \
    --standalone \
    --rdzv-backend=c10d \
    --rdzv-endpoint=localhost:$master_port \
    --nnodes=1 \
    --nproc-per-node=$num_gpus_per_node \
    finetune.py"
fi

# Accumulated gradient calculation
echo "slurm_nodelist=${SLURM_NODELIST} num_nodes=${num_nodes} num_gpus_per_node=${num_gpus_per_node} total_gpus=${total_gpus} master_addr=${master_addr}"

run_name="multinode_${model_alias}_$(basename $dataset)_bsz-${bsz}_lr-${lr}_steps-${steps}"

out_dir="<out_dir>"
mkdir -p $out_dir

fsdp=${FSDP:-"1"}
gc=${GC:-"1"}
export OMP_NUM_THREADS=$num_gpus_per_node
export WANDB_PROJECT="prolong"
export WANDB_DIR=$out_dir
export WANDB_MODE="online"
export TOKENIZERS_PARALLELISM=true
export FSDP_SHARDING_STRATEGY="1"
export FSDP_STATE_DICT_TYPE="FULL_STATE_DICT"

base_arguments=(
    --apply_instruct_masks
    --token_scaled_loss
    --seq_parallel_size $num_gpus_per_node
    
    --report_to wandb
    --do_train

    --model_name_or_path $model
    --config_name $model
    --tokenizer_name $model

    --run_name $run_name
    --output_dir $out_dir
    --config_overrides_json "$overrides"
    --gradient_accumulation_steps $accu
    --per_device_train_batch_size $seq
    --per_device_eval_batch_size $seq

    --bf16
    --learning_rate $lr
    --min_lr_ratio 0.1
    --lr_scheduler_type cosine
    --max_grad_norm 1.0
    --adam_beta1 0.9
    --adam_beta2 0.95
    --weight_decay 0.1
    --warmup_ratio $warmup
    --optim adamw_torch

    --logging_steps 1
    --log_level info

    --max_steps $steps
    --save_steps $save_steps
    --dataloader_num_workers 1

    --disable_tqdm true
    --use_fast_tokenizer false
    --remove_unused_columns false
    --ddp_find_unused_parameters false

    --fsdp "auto_wrap"
    --gradient_checkpointing

    --tokenized_mds_train $dataset

    --cuda_empty_cache
)

echo command: "${header} ${base_arguments[@]}"
${header} "${base_arguments[@]}" 2>&1 | tee -a $out_dir/log.out

Curious to hear your insights!

@CodeCreator
Copy link
Member

If you use slurm then srun is important, because it makes sure that the job is launched on all nodes. Without these ever running, the head node gets the rendezvous timeout error. So probably try to understand why you get the slurm specification error with srun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants