You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since this keeps happening while downloading files or chunking, my best guess is that those IO heavy tasks are making the process temporarily unresponsive. I guess the DistributedRunner isn't able to wait or reestablish the connection?
My current workaround is to cache the downloaded and chunked files separately, and then provide that cache to the docker container for training on each node. Maybe this could be handled on the DistributedRunner side? Otherwise the examples should be update to perform this data preparation outside of the gaudi_spawn call.
System Info
Optimum Habana: 1.10.4
Synapse: 1.14.0
Dockerfile:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the gpt-neox command provided in the language-models example inside the Docker image created from the above snippet.
https://github.com/huggingface/optimum-habana/blob/v1.10.4/examples/language-modeling/README.md#multi-node-training-with-deepspeed-gpt-neox
My commands to reproduce on two nodes with ssh access to one another is:
and then
Sometimes this completes fine. However, frequently it fails with a bunch of:
RuntimeError: Connection reset by peer
what(): Broken pipe
Internal Error: Received signal - Aborted
terminate called after throwing an instance of
std::system_error` and finally[INFO] [launch.py:316:sigfill_handler] Killing subprocess ...
Since this keeps happening while downloading files or chunking, my best guess is that those IO heavy tasks are making the process temporarily unresponsive. I guess the
DistributedRunner
isn't able to wait or reestablish the connection?My current workaround is to cache the downloaded and chunked files separately, and then provide that cache to the docker container for training on each node. Maybe this could be handled on the
DistributedRunner
side? Otherwise the examples should be update to perform this data preparation outside of thegaudi_spawn
call.Expected behavior
Have the
gaudi_spawn
(https://github.com/huggingface/optimum-habana/blob/v1.10.4/examples/language-modeling/README.md#multi-node-training-with-deepspeed-gpt-neox) not error withConnection reset
andBroken pipe
errors while downloading and/or chunking data.The text was updated successfully, but these errors were encountered: