-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about performance #27
Comments
It could be possible to add different layouts as an option. I haven't had time to look into it so far. |
Yes, I checked. It is using AMD Radeon Pro 555 Compute_Engine.
On the batches of size 1 and 2 the CPU version is faster. Starting with the batch of size 3 and above, libDNNs versions are faster. But, interestingly enough, they are not dramatically faster. E.g. with a batch of size 30, the GPU version is only about 1.2x faster than the CPU version. |
How do you allocate the memory for filter, input and output? But you're correct, a Radeon Pro 555 will not be dramatically faster than a CPU using all cores. The theoretical limit is about 4 times faster, in practice probably more like 2 times. |
All the buffers are allocated in advance on the device. There is no dynamic memory allocation happening during the run. To be more precise, I compute the total size of the required memory for all buffers in advance, then allocate a huge buffer big enough to hold this amount and then all the concrete buffers are assigned offsets inside this region. There is some simple logic at runtime to map (cl_mem address of the huge buffer, offset of the concrete buffer) to cl_mem address of the concrete buffer. On CPU, all the buffers are allocated in advance as well in a similar way. And as I mentioned, the network is running on a CPU using just a single core. |
Can I see the implementation? Can you give me the convolution parameters and how long it takes to compute it? Make sure when taking the time of just one OpenCL operation to use appropriate commands to flush the OpenCL queue before stopping the timer. |
I do test with the whole resnet50 and I run multiple iterations of it to make sure that the first run involving e.g. kernel compilations is not influencing the picture too much.
I do not use any BLAS-based methods. It is a direct convolution. As I mentioned in the first message of this issue, the CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout:
Not yet, but rather soon ;-) Most likely in 2 weeks or a month.
I'll try to collect more precise measurements. |
Ok, the more I know the better I can help with performance related questions. |
@naibaf7 I tried to build https://github.com/BVLC/caffe/tree/opencl locally. Now I'm trying to figure out how to run the resnet50 with it. I have some existing python scripts doing it, but they use Caffe2 python APIs (e.g. workspace) which are not available in this fork of Caffe. Could you suggest how I can run the existing pre-trained models by providing the downloaded pb files? |
Resnet50 can be obtained here, among other places: |
@naibaf7 Thanks! I downloaded these models. And I'm able to run them. BTW, if I run on the OpenCL CPU backend, libDNN asserts:
This is most likely due to the fact that Apple's OpenCL on CPU has the max workgroup sizes like this (1024, 1, 1) and libDNN generated kernels require at least 16 along the second and third dimensions. This is not very important, as running on the CPU is probably a bad idea, but may be you want to fix it, so that libDNN can run on only OpenCL implementation. Now, if I select AMD Radeon Pro 555 as a GPU device, the model runs properly, but I get the following exception when the program terminates:
Again, it is not very important, because this happens after the run and after all the stats are printed. I don't know how to set the batch size for Caffe from the command-line, so I assume it uses the default of 1. The Caffe stats on the AMD GPU look like:
Caffe stats for running on CPU:
My own resnet50 on the same GPU needs 136 ms for the forward pass. So, overall, my implementations seem to be about 2x faster than Caffe. We also see that on small batch sizes GPU versions are slower than their CPU counterparts (both in Caffe and in my implementations) |
I've got the following performance numbers (batch size 16, average forward pass):
Additonally, the AMD Vega FE 16 GB can handle a batch size of 32, then we get:
If the batch size is too small (1) we get:
The theoretical flops for these are:
As you can see, OpenCL+LibDNN is surprisingly close catching up to cuDNN with sufficient work sizes. Please test again using a larger batch size than 1. With batch sizes 1, the parallelism of GPUs can't quite come into effect. Just replace the 1 on line 8 in the deploy.txt to 4, 8 or 16. According to the TFlops numbers, we'd expect to see 855.35 ms on your RX 555, respectively about 53.45 ms per image. That would indeed be faster than your CPU and GPU implementation. Now, since it only has 2 GB of RAM, groups of 16 are probably too big, but 8 or 4 should work. Maybe the first iteration also causes the number to go up due to compiling LibDNN kernels?
if that's the case, then choosing more iterations to average it out (i.e. 50 iterations) would show truer numbers: Also thanks about the heads-up of the Apple OpenCL implementation. If they indeed only allow (1024, 1, 1) and no symmetric ranges on first and second dimension, then LibDNN will work abysmally bad on it, and I have no intent in fixing that. Even the Raspberry Pi VideoCore can handle normal sizes like (3,3,1). Apple needs to fix this. I assume you use the same image size in your ResNet50 as in Caffe? For Caffe, the image size (spatial dimensions) used here were 224x224. |
OK. Here are the numbers (batch size 16, average forward pass, 10 iterations):
For batch size of 32:
I wonder why OpenCL + LibDNN start to win against CPU with increasing batch sizes in my implementations, but improve much slower in case of Caffe. Also, my implementations seem to scale linearly with the batch size, but Caffe implementations do not seem to follow the same pattern (e.g. they are about 3x-4x slower when batch size changes from 16 to 32) Note: I assume they are equivalent to the ones from https://github.com/cvjena/cnn-models/tree/master/ResNet_preact/ResNet50_cvgj |
While I agree with you, I don't believe they'll fix it. It seems like Apple more or less abandoned any further OpenCL developments and concentrated on their Metal APIs. As for the (1024, 1, 1) limitations, I don't remember if it is always the case or of it kicks in only if a kernel uses memory fencing like
Yes, of course. |
One factor could be memory allocation and total memory use. Batch size 32 certainly would not fit onto on-GPU memory (2GB) of your Radeon RX 555 in the case of Caffe. |
I tried with more iterations, but get more or less the same numbers. |
My implementation says it needs 805516800 bytes (about 768.2 MBs) for it. So, it fits into 2GB. |
I suspect Apple's OpenCL implementation. May be it is too old or something like this... I really need to test it on a more high-end non-Apple hardware. |
@romix That's very little for ResNet-50. Do you re-use the buffers during inference-only forward passes? |
Yes, my implementation performs a very aggressive buffers re-use during inference-only forward passes. |
@naibaf7 Regarding the efficient convolution on a CPU. You asked if you could see the implementation, so that you can compare. Please have a look at the following function: https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/libjit/libjit_conv.cpp#L233-L262 As I mentioned before, this convolution implementation on a CPU outperforms the GPU implementations produced by libDNNs on batches of size 1 and 2. If we increase the batch size, libDNN generated kernels start beating the CPU version and their lead slowly increases with increasing batch sizes. |
Thanks, I'll check it out :) |
Which commit is it? After a quick look, I could not find it. |
naibaf7/caffe@411defe |
With #26 fixed on my side, I was able to perform some benchmarks now. The libDNN generated convolutions are about 3x-4x faster than my naive kernel described in #26, which is very nice! But they are slower than my convolutions running on the CPU :-(
The CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout:
filter = [depth/N, filter, filter, channel, N]
, where N is 8. This is done to make access to the filter more cache friendly. As far as I understand, the following TVM trick uses a similar approach: http://tvmlang.org/2018/01/16/opt-mali-gpu.html (see tiling and packing).WDYT about this kind of layout optimization? Have you played with something like this? Do you think it may result in even faster convolution kernels?
BTW, I'm testing on a MacBook Pro 2017 using the AMD GPU.
The text was updated successfully, but these errors were encountered: