Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy IO in multi-node example #1152

Open
2 of 4 tasks
rofinn opened this issue Jul 22, 2024 · 1 comment
Open
2 of 4 tasks

Heavy IO in multi-node example #1152

rofinn opened this issue Jul 22, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@rofinn
Copy link

rofinn commented Jul 22, 2024

System Info

Optimum Habana: 1.10.4
Synapse: 1.14.0
Dockerfile:

FROM vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

# Installs pdsh and upgrade pip
RUN apt-get update && apt-get install -y pdsh && \
   python -m pip install --upgrade pip

# Docker ssh port setup
RUN sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config && \
   sed -i 's/#   Port 22/    Port 3022/g' /etc/ssh/ssh_config && \
   sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
   service ssh restart

# Installs Optimum Habana and Habana's fork of DeepSpeed
RUN pip install optimum-habana==1.10.4 && \
   pip install git+https://github.com/HabanaAI/[email protected]

# We gonna skip this step in favour of just providing all the ssh information
#CMD ssh-keygen -t rsa -b 4096 -N '' -f ~/.ssh/id_rsa && \
#   chmod 600 ~/.ssh/id_rsa && \
#   cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys && \
#   /bin/bash
COPY .ssh /root/.ssh
COPY optimum-habana/examples/gaudi_spawn.py /root/
COPY optimum-habana/tests/configs/deepspeed_zero_1.json /root/
COPY optimum-habana/tests/configs/deepspeed_zero_2.json /root/
COPY optimum-habana/examples/language-modeling/* /root/
COPY llm-multi/hostfile /root/
RUN pip install -r /root/requirements.txt
RUN pip install --force-reinstall -v "pytest==8.0.0"
WORKDIR /root

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run the gpt-neox command provided in the language-models example inside the Docker image created from the above snippet.

https://github.com/huggingface/optimum-habana/blob/v1.10.4/examples/language-modeling/README.md#multi-node-training-with-deepspeed-gpt-neox

My commands to reproduce on two nodes with ssh access to one another is:

sudo docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host <image_name>

and then

python gaudi_spawn.py --hostfile hostfile --use_deepspeed run_clm.py --model_name_or_path EleutherAI/gpt-neox-20b --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --do_train --do_eval --output_dir /tmp/test-clm-xl-bs2 --overwrite_output_dir --gaudi_config_name habana/gpt2 --use_habana --use_lazy_mode --gradient_checkpointing --use_hpu_graphs_for_inference --throughput_warmup_steps 3 --deepspeed deepspeed_zero_2.json

Sometimes this completes fine. However, frequently it fails with a bunch of:

  1. RuntimeError: Connection reset by peer
  2. what(): Broken pipe
  3. Internal Error: Received signal - Aborted
  4. terminate called after throwing an instance of std::system_error` and finally
  5. [INFO] [launch.py:316:sigfill_handler] Killing subprocess ...

Since this keeps happening while downloading files or chunking, my best guess is that those IO heavy tasks are making the process temporarily unresponsive. I guess the DistributedRunner isn't able to wait or reestablish the connection?

My current workaround is to cache the downloaded and chunked files separately, and then provide that cache to the docker container for training on each node. Maybe this could be handled on the DistributedRunner side? Otherwise the examples should be update to perform this data preparation outside of the gaudi_spawn call.

Expected behavior

Have the gaudi_spawn (https://github.com/huggingface/optimum-habana/blob/v1.10.4/examples/language-modeling/README.md#multi-node-training-with-deepspeed-gpt-neox) not error with Connection reset and Broken pipe errors while downloading and/or chunking data.

@rofinn rofinn added the bug Something isn't working label Jul 22, 2024
@regisss
Copy link
Collaborator

regisss commented Oct 22, 2024

@rofinn Still having this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants