You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training StyleGAN on multiple GPUs requires Nccl, which is not included on windows.
There is some custom way of reducing and updating all of the gradients across the devices which is not similar to the api's exposed by tensorflow.
This causes an error like:
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node TrainD/SumAcrossGPUs/NcclAllReduce (defined at D:\data\oliver-train-checkface\fflowhq\00005-sgan-flower-1gpu\src\dnnlib\tflib\optimizer.py:135) with these attrs: [reduction="sum", shared_name="c124", T=DT_FLOAT, num_devices=2]
This is the point at which all of the device gradients are summed together before updating each of the devices. However, higher level api's like HierarchicalAllReduce would handle this entire process, including the updating of each of the devices, but is not well suited to this use case.
Training StyleGAN on multiple GPUs requires Nccl, which is not included on windows.
There is some custom way of reducing and updating all of the gradients across the devices which is not similar to the api's exposed by tensorflow.
This causes an error like:
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node TrainD/SumAcrossGPUs/NcclAllReduce (defined at D:\data\oliver-train-checkface\fflowhq\00005-sgan-flower-1gpu\src\dnnlib\tflib\optimizer.py:135) with these attrs: [reduction="sum", shared_name="c124", T=DT_FLOAT, num_devices=2]
There is no drop in replacement that has been found, because the api for tf generic operations like a
HierachicalAllReduce
which is used in Keras like in: tensorflow/tensorflow#21470is not compatible with the
nccl_ops.py
interface https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops.pyPerhaps even more surprising is the fact that other ops, like:
collective_ops.py
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops.py
do not provide drop in replacements. These ops seem to have completely different use cases as is made clear by their use in tests:
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops_test.py
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops_test.py
The line that needs to be updated or removed seems to be the following:
checkface/src/server/dnnlib/tflib/optimizer.py
Line 135 in a88dab0
This is the point at which all of the device gradients are summed together before updating each of the devices. However, higher level api's like
HierarchicalAllReduce
would handle this entire process, including the updating of each of the devices, but is not well suited to this use case.@olivercoad
The text was updated successfully, but these errors were encountered: