Heavy IO in multi-node example #1152

rofinn · 2024-07-22T22:51:54Z

System Info

Optimum Habana: 1.10.4
Synapse: 1.14.0
Dockerfile:

FROM vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

# Installs pdsh and upgrade pip
RUN apt-get update && apt-get install -y pdsh && \
   python -m pip install --upgrade pip

# Docker ssh port setup
RUN sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config && \
   sed -i 's/#   Port 22/    Port 3022/g' /etc/ssh/ssh_config && \
   sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
   service ssh restart

# Installs Optimum Habana and Habana's fork of DeepSpeed
RUN pip install optimum-habana==1.10.4 && \
   pip install git+https://github.com/HabanaAI/[email protected]

# We gonna skip this step in favour of just providing all the ssh information
#CMD ssh-keygen -t rsa -b 4096 -N '' -f ~/.ssh/id_rsa && \
#   chmod 600 ~/.ssh/id_rsa && \
#   cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys && \
#   /bin/bash
COPY .ssh /root/.ssh
COPY optimum-habana/examples/gaudi_spawn.py /root/
COPY optimum-habana/tests/configs/deepspeed_zero_1.json /root/
COPY optimum-habana/tests/configs/deepspeed_zero_2.json /root/
COPY optimum-habana/examples/language-modeling/* /root/
COPY llm-multi/hostfile /root/
RUN pip install -r /root/requirements.txt
RUN pip install --force-reinstall -v "pytest==8.0.0"
WORKDIR /root

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run the gpt-neox command provided in the language-models example inside the Docker image created from the above snippet.

https://github.com/huggingface/optimum-habana/blob/v1.10.4/examples/language-modeling/README.md#multi-node-training-with-deepspeed-gpt-neox

My commands to reproduce on two nodes with ssh access to one another is:

sudo docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host <image_name>

and then

python gaudi_spawn.py --hostfile hostfile --use_deepspeed run_clm.py --model_name_or_path EleutherAI/gpt-neox-20b --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --do_train --do_eval --output_dir /tmp/test-clm-xl-bs2 --overwrite_output_dir --gaudi_config_name habana/gpt2 --use_habana --use_lazy_mode --gradient_checkpointing --use_hpu_graphs_for_inference --throughput_warmup_steps 3 --deepspeed deepspeed_zero_2.json

Sometimes this completes fine. However, frequently it fails with a bunch of:

RuntimeError: Connection reset by peer
what(): Broken pipe
Internal Error: Received signal - Aborted
terminate called after throwing an instance of std::system_error` and finally
[INFO] [launch.py:316:sigfill_handler] Killing subprocess ...

Since this keeps happening while downloading files or chunking, my best guess is that those IO heavy tasks are making the process temporarily unresponsive. I guess the DistributedRunner isn't able to wait or reestablish the connection?

My current workaround is to cache the downloaded and chunked files separately, and then provide that cache to the docker container for training on each node. Maybe this could be handled on the DistributedRunner side? Otherwise the examples should be update to perform this data preparation outside of the gaudi_spawn call.

Expected behavior

Have the gaudi_spawn (https://github.com/huggingface/optimum-habana/blob/v1.10.4/examples/language-modeling/README.md#multi-node-training-with-deepspeed-gpt-neox) not error with Connection reset and Broken pipe errors while downloading and/or chunking data.

The text was updated successfully, but these errors were encountered:

regisss · 2024-10-22T08:32:14Z

@rofinn Still having this issue?

rofinn added the bug Something isn't working label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy IO in multi-node example #1152

Heavy IO in multi-node example #1152

rofinn commented Jul 22, 2024 •

edited

Loading

regisss commented Oct 22, 2024

Heavy IO in multi-node example #1152

Heavy IO in multi-node example #1152

Comments

rofinn commented Jul 22, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

regisss commented Oct 22, 2024

rofinn commented Jul 22, 2024 •

edited

Loading