Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nccl Library Unavailable on windows #33

Open
cdilga opened this issue Nov 5, 2019 · 0 comments
Open

Nccl Library Unavailable on windows #33

cdilga opened this issue Nov 5, 2019 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@cdilga
Copy link
Contributor

cdilga commented Nov 5, 2019

Training StyleGAN on multiple GPUs requires Nccl, which is not included on windows.
There is some custom way of reducing and updating all of the gradients across the devices which is not similar to the api's exposed by tensorflow.

This causes an error like:

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node TrainD/SumAcrossGPUs/NcclAllReduce (defined at D:\data\oliver-train-checkface\fflowhq\00005-sgan-flower-1gpu\src\dnnlib\tflib\optimizer.py:135) with these attrs: [reduction="sum", shared_name="c124", T=DT_FLOAT, num_devices=2]

There is no drop in replacement that has been found, because the api for tf generic operations like a HierachicalAllReduce which is used in Keras like in: tensorflow/tensorflow#21470
is not compatible with the nccl_ops.py interface https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops.py

Perhaps even more surprising is the fact that other ops, like: collective_ops.py
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops.py
do not provide drop in replacements. These ops seem to have completely different use cases as is made clear by their use in tests:
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops_test.py
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops_test.py

The line that needs to be updated or removed seems to be the following:

g = nccl_ops.all_sum(g)

This is the point at which all of the device gradients are summed together before updating each of the devices. However, higher level api's like HierarchicalAllReduce would handle this entire process, including the updating of each of the devices, but is not well suited to this use case.

@olivercoad

@cdilga cdilga self-assigned this Nov 5, 2019
@cdilga cdilga added the bug Something isn't working label Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant