Skip to content
This repository has been archived by the owner on Mar 6, 2024. It is now read-only.

Can we use bitfusion to run Distributed Data Parallel Pytorch code? #43

Open
ljz756245026 opened this issue Sep 27, 2021 · 2 comments
Open

Comments

@ljz756245026
Copy link

Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server without bitfusion. I want to know that whether bitfusion does not support torch.nn.DataParallel (https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) or nccl (https://developer.nvidia.com/nccl).

I am looking forward to your reply.

@YanJenHuang
Copy link

YanJenHuang commented Oct 25, 2022

Have you solved this issue? I met the same problems too. But could not find any resources or solutions.

@ljz756245026
Copy link
Author

ljz756245026 commented Oct 25, 2022

No! Bitfusion does not support DDP for the reason that some NCCL versions are not supported by Bitfusion. However, we cannot change the nccl version.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants