-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CrossFire & MultiGPU & GPU workload #90
Comments
Great question. I don't believe we've actually tried a CrossFire setup yet. I recall @adtsai tested SqueezeNet with two non-CrossFired GPUs. If I remember correctly, alternate training steps were executed on the two devices. However, this required an "enlightened" model deployment (see model_deploy.py). Adrian may recall more info on this experiment. There are some distributed execution strategies in TF that automatically assign work in an intelligent way. However, we haven't implemented this functionality yet and it may be awhile before we do. Our initial focus is on single-GPU performance since this is what the majority of users will have. For now, my guess is that you'll get the best performance by treating the two cards as separate devices and having the model explicitly assign big chunks of work to each. Long term we'd like to make it more automatic. We can ask AMD directly on question 2b. We currently use a "direct" command queue instead of a compute-specific command queue, so using the "Compute Workload" in AMD settings may not make any difference. I'll get back to you on this. |
@YuriyTigiev: If the Crossfire option in the AMD control panel does what I think it does, I would recommend disabling it. TensorFlow actually expects the GPUs to appear as separate devices - distribution of work across the GPUs is handled by TF itself rather than the driver. Similarly you should disable SLI if you happen to have NVIDIA GPUs. As @jstoecker mentioned though, our support is distributed execution isn't 100% complete yet. That being said, if you un-crossfire your GPUs, both of them should be visible to TensorFlow. (They should show up as e.g. /device:DML:0 and /device:DML:1 for example). Our Squeezenet sample kind of does support multi-GPU training, it's just not enabled by default. If you'd like to test out multi-GPU, try the following:
|
I'm sharing the result of testing of the squeezenet With Crossfire
Without CrossFire
|
It looks like you're only running There's also script in the squeezenet sample, We also have the yolov3 sample if SqueezeNet isn't cutting it for your needs. AI Benchmark is another set of scripts you can try out for performance, and we're using it to help guide some of our upcoming perf work. |
Result of SqueezeNet testing Crossfire Disabled Crossfire Enabled crossfire_disabled_cifar_None32_NCHW_201118-230041.zip CrossFire Disabled
CrossFire Enabled
|
Hi Yuriy, keep in mind that the results may not be an apples-to-apples comparison because of the way the model divides the batches between the GPUs. These scripts pick a default batch_size of 32, and this represents the minibatch size for each GPU. That is, if running on one GPU you're computing 32 batches per step. If running on two GPUs, you're computing 64 total batches per step (32 batches per GPU). (And remember that if you have Crossfire enabled, TensorFlow only sees them as being one GPU, not two.) This means that, for a fixed number of training steps, you won't necessarily see a lower wall clock time with more GPUs -- but you likely will see the model converge faster, because you're computing more total batches when you add more GPUs. One thing you could try to offset this is to set e.g. The other thing to keep in mind is that Squeezenet is, naturally, quite a small model that only takes a few minutes to train. There's a certain amount of fixed synchronization overhead when using multiple GPUs, and for a small model like Squeezenet (where a training step only takes a handful of milliseconds) this overhead can take up a large proportion of the total running time. With a larger model (or a much larger batch_size - say, 256 or 512) you should start to see clearer performance benefits of multi-GPU setups. |
The current source files have a size of 32x32 pixels. Can I use source files with resolution 64x64? Should I change parameters or modify the code of squeezenet for supporting this resolution? I would like to test the performance of the cards. |
2.a) enable or disable CrossFire?
2.b) set GPU Workload - Compute or Graphics?
WARNING:tensorflow:From C:\Users\yuriy\source\repos\PythonApplication1\PythonApplication1\env\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-11-15 19:13:48.169040: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-15 19:13:48.243361: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 2 compatible adapters.
2020-11-15 19:13:48.244920: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-15 19:13:48.246223: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-15 19:13:48.411156: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530)
The text was updated successfully, but these errors were encountered: