CrossFire & MultiGPU & GPU workload #90

YuriyTigiev · 2020-11-15T15:25:36Z

How to check that the library uses both GPU AMD RX570 (as one logical card with two nodes)?
For max performance should I
2.a) enable or disable CrossFire?
2.b) set GPU Workload - Compute or Graphics?

WARNING:tensorflow:From C:\Users\yuriy\source\repos\PythonApplication1\PythonApplication1\env\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-11-15 19:13:48.169040: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-15 19:13:48.243361: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 2 compatible adapters.
2020-11-15 19:13:48.244920: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-15 19:13:48.246223: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-15 19:13:48.411156: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530)

jstoecker · 2020-11-16T18:18:51Z

Great question.

I don't believe we've actually tried a CrossFire setup yet. I recall @adtsai tested SqueezeNet with two non-CrossFired GPUs. If I remember correctly, alternate training steps were executed on the two devices. However, this required an "enlightened" model deployment (see model_deploy.py). Adrian may recall more info on this experiment.

There are some distributed execution strategies in TF that automatically assign work in an intelligent way. However, we haven't implemented this functionality yet and it may be awhile before we do. Our initial focus is on single-GPU performance since this is what the majority of users will have. For now, my guess is that you'll get the best performance by treating the two cards as separate devices and having the model explicitly assign big chunks of work to each. Long term we'd like to make it more automatic.

We can ask AMD directly on question 2b. We currently use a "direct" command queue instead of a compute-specific command queue, so using the "Compute Workload" in AMD settings may not make any difference. I'll get back to you on this.

adtsai · 2020-11-16T22:57:22Z

@YuriyTigiev: If the Crossfire option in the AMD control panel does what I think it does, I would recommend disabling it. TensorFlow actually expects the GPUs to appear as separate devices - distribution of work across the GPUs is handled by TF itself rather than the driver. Similarly you should disable SLI if you happen to have NVIDIA GPUs.

As @jstoecker mentioned though, our support is distributed execution isn't 100% complete yet. That being said, if you un-crossfire your GPUs, both of them should be visible to TensorFlow. (They should show up as e.g. /device:DML:0 and /device:DML:1 for example). Our Squeezenet sample kind of does support multi-GPU training, it's just not enabled by default.

If you'd like to test out multi-GPU, try the following:

Disable Crossfire in the AMD control panel, so that the RX 570's show up as two separate devices
Clone our Samples repo and in train.py, add the following after line 40: cl.append("--num_gpus 2")
Run setup.py, then train.py. You should (in theory) see both GPUs active.
You may also need to set the DML_VISIBLE_DEVICES environment variable to disable your Intel Integrated GPU, if it's mistakenly being used instead of the RX 570's. Or you can even try setting --num_gpus to 3 to run part of the training workload on the Intel iGPU as well, but I'm not sure if I would necessarily recommend mixing iGPUs with discrete GPUs like that. :)

YuriyTigiev · 2020-11-17T05:25:24Z

I'm sharing the result of testing of the squeezenet
If is it possible could you share a sample which high loads a gpu for 5-10 minutes?
And the sample should also measure average GPU load and execution time and save results in the file for comparing.

With Crossfire

PS C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet> python setup.py
>> Downloading cifar-10-python.tar.gz 100.0%
Successfully downloaded cifar-10-python.tar.gz 170498071 bytes.
2020-11-17 09:11:09.364658: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-17 09:11:09.414070: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 2 compatible adapters.
2020-11-17 09:11:09.414396: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-17 09:11:09.415115: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-17 09:11:09.549068: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530)
>> Reading file [data\cifar-10-batches-py\test_batch] image 10000/10000WARNING:tensorflow:From C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet\src\models\research\slim\datasets\dataset_utils.py:176: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.


Finished converting the Cifar10 dataset!
cifar2png found at  C:\Users\yuriy\AppData\Local\Programs\Python\Python37\Scripts\cifar2png
cifar-10-python.tar.gz does not exists.
Downloading cifar-10-python.tar.gz
167kKB [01:03, 2.62kKB/s]
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
Saving train images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50000/50000 [00:32<00:00, 1548.15it/s]
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
Saving test images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:06<00:00, 1541.87it/s]
PS C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet>

Without CrossFire

PS C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet> python setup.py
>> Downloading cifar-10-python.tar.gz 100.0%
Successfully downloaded cifar-10-python.tar.gz 170498071 bytes.
2020-11-17 09:17:49.007047: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-17 09:17:49.125437: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 3 compatible adapters.
2020-11-17 09:17:49.125742: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-17 09:17:49.126487: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-17 09:17:49.202725: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Radeon RX 570 Series)
2020-11-17 09:17:49.279662: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 2 (Intel(R) HD Graphics 530)
>> Reading file [data\cifar-10-batches-py\test_batch] image 10000/10000WARNING:tensorflow:From C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet\src\models\research\slim\datasets\dataset_utils.py:176: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.


Finished converting the Cifar10 dataset!
cifar2png found at  C:\Users\yuriy\AppData\Local\Programs\Python\Python37\Scripts\cifar2png
cifar-10-python.tar.gz does not exists.
Downloading cifar-10-python.tar.gz
167kKB [00:48, 3.42kKB/s]
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
Saving train images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50000/50000 [00:32<00:00, 1535.87it/s]
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.     
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file -p already exists.
Error occurred while processing: -p.
Saving test images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:06<00:00, 1535.02it/s]
PS C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet>

jstoecker · 2020-11-17T20:44:08Z

It looks like you're only running setup.py for SqueezeNet, which merely prepares the dataset for training. You'll want to run train.py to do some real work. You can increase the number of training steps to something really high if you want it to run for awhile (e.g. train.py --max_train_steps 20000).

There's also script in the squeezenet sample, trace.py, that outputs chrome traces of each training step. These traces can be viewed in a Chromium-based browser by going to chrome://tracing and opening a .json file. This isn't quite the same as wall time measurements for each training step, but it can give you a rough idea of performance, device placement. You should be able to see if ops are placed on one or both GPUs, for example. Just make sure to look at a trace other than the very first training step, since that's always very slow.

We also have the yolov3 sample if SqueezeNet isn't cutting it for your needs. AI Benchmark is another set of scripts you can try out for performance, and we're using it to help guide some of our upcoming perf work.

YuriyTigiev · 2020-11-18T19:23:18Z

Result of SqueezeNet testing

Crossfire Disabled
Used GPU: 2
Avg GPU load: 1 - 55%, 2 - 62%
Execution time: 488.4932293891907

Crossfire Enabled
Used GPU: 1.
Avg GPU Load: 90%
Execution time:
383.6913380622864

crossfire_disabled_cifar_None32_NCHW_201118-230041.zip
crossfaair_enabled_cifar_None32_NCHW_201118-231335.zip

CrossFire Disabled

cl.append("--num_gpus 2")

import time
start_time = time.time()
subprocess.run(" ".join(cl), shell=True)
print("--- %s seconds ---" % (time.time() - start_time))


PS C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet> python ./train.py 
2020-11-18 22:37:37.885827: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-18 22:37:37.996822: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-18 22:37:38.111786: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 3 compatible adapters.
2020-11-18 22:37:38.111982: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-18 22:37:38.181488: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Radeon RX 570 Series)
2020-11-18 22:37:38.246133: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 2 (Intel(R) HD Graphics 530)
WARNING:tensorflow:From C:\Users\yuriy\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\autograph\converters\directives.py:119: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING:tensorflow:From C:\Users\yuriy\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\autograph\converters\directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

WARNING:tensorflow:From C:\Users\yuriy\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\autograph\converters\directives.py:119: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.


Train Step 1800 :  0.7344
Evaluation Step 1800 :  0.5374
--- 488.4932293891907 seconds ---

CrossFire Enabled

cl.append("--num_gpus 1")

import time
start_time = time.time()
subprocess.run(" ".join(cl), shell=True)
print("--- %s seconds ---" % (time.time() - start_time))



PS C:\Users\yuriy\source\repos\DirectML\TensorFlow\squeezenet> python ./train.py 
2020-11-18 23:03:54.870160: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-18 23:03:55.023728: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-18 23:03:55.066139: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 2 compatible adapters.
2020-11-18 23:03:55.066245: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-18 23:03:55.189559: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530)
WARNING:tensorflow:From C:\Users\yuriy\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\autograph\converters\directives.py:119: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING:tensorflow:From C:\Users\yuriy\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\autograph\converters\directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

WARNING:tensorflow:From C:\Users\yuriy\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\autograph\converters\directives.py:119: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
Evaluation Step 1800 :  0.478
--- 383.6913380622864 seconds ---

adtsai · 2020-11-20T02:17:14Z

Hi Yuriy, keep in mind that the results may not be an apples-to-apples comparison because of the way the model divides the batches between the GPUs. These scripts pick a default batch_size of 32, and this represents the minibatch size for each GPU. That is, if running on one GPU you're computing 32 batches per step. If running on two GPUs, you're computing 64 total batches per step (32 batches per GPU).

(And remember that if you have Crossfire enabled, TensorFlow only sees them as being one GPU, not two.)

This means that, for a fixed number of training steps, you won't necessarily see a lower wall clock time with more GPUs -- but you likely will see the model converge faster, because you're computing more total batches when you add more GPUs. One thing you could try to offset this is to set e.g. --batch_size 64 when using a single GPU, but --batch_size 32 when using two GPUs. This means you'll be computing the same total number of batches, which still isn't necessarily directly comparable but it'll be closer at least.

The other thing to keep in mind is that Squeezenet is, naturally, quite a small model that only takes a few minutes to train. There's a certain amount of fixed synchronization overhead when using multiple GPUs, and for a small model like Squeezenet (where a training step only takes a handful of milliseconds) this overhead can take up a large proportion of the total running time. With a larger model (or a much larger batch_size - say, 256 or 512) you should start to see clearer performance benefits of multi-GPU setups.

YuriyTigiev · 2020-11-25T19:20:32Z

Hi Yuriy, keep in mind that the results may not be an apples-to-apples comparison because of the way the model divides the batches between the GPUs. These scripts pick a default batch_size of 32, and this represents the minibatch size for each GPU. That is, if running on one GPU you're computing 32 batches per step. If running on two GPUs, you're computing 64 total batches per step (32 batches per GPU).

(And remember that if you have Crossfire enabled, TensorFlow only sees them as being one GPU, not two.)

This means that, for a fixed number of training steps, you won't necessarily see a lower wall clock time with more GPUs -- but you likely will see the model converge faster, because you're computing more total batches when you add more GPUs. One thing you could try to offset this is to set e.g. --batch_size 64 when using a single GPU, but --batch_size 32 when using two GPUs. This means you'll be computing the same total number of batches, which still isn't necessarily directly comparable but it'll be closer at least.

The other thing to keep in mind is that Squeezenet is, naturally, quite a small model that only takes a few minutes to train. There's a certain amount of fixed synchronization overhead when using multiple GPUs, and for a small model like Squeezenet (where a training step only takes a handful of milliseconds) this overhead can take up a large proportion of the total running time. With a larger model (or a much larger batch_size - say, 256 or 512) you should start to see clearer performance benefits of multi-GPU setups.

The current source files have a size of 32x32 pixels. Can I use source files with resolution 64x64? Should I change parameters or modify the code of squeezenet for supporting this resolution? I would like to test the performance of the cards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrossFire & MultiGPU & GPU workload #90

CrossFire & MultiGPU & GPU workload #90

YuriyTigiev commented Nov 15, 2020

jstoecker commented Nov 16, 2020

adtsai commented Nov 16, 2020

YuriyTigiev commented Nov 17, 2020

jstoecker commented Nov 17, 2020

YuriyTigiev commented Nov 18, 2020

adtsai commented Nov 20, 2020

YuriyTigiev commented Nov 25, 2020

CrossFire & MultiGPU & GPU workload #90

CrossFire & MultiGPU & GPU workload #90

Comments

YuriyTigiev commented Nov 15, 2020

jstoecker commented Nov 16, 2020

adtsai commented Nov 16, 2020

YuriyTigiev commented Nov 17, 2020

With Crossfire

Without CrossFire

jstoecker commented Nov 17, 2020

YuriyTigiev commented Nov 18, 2020

CrossFire Disabled

CrossFire Enabled

adtsai commented Nov 20, 2020

YuriyTigiev commented Nov 25, 2020