Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about performance #27

Open
romix opened this issue Apr 25, 2018 · 23 comments
Open

Question about performance #27

romix opened this issue Apr 25, 2018 · 23 comments

Comments

@romix
Copy link

romix commented Apr 25, 2018

With #26 fixed on my side, I was able to perform some benchmarks now. The libDNN generated convolutions are about 3x-4x faster than my naive kernel described in #26, which is very nice! But they are slower than my convolutions running on the CPU :-(

The CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout:
filter = [depth/N, filter, filter, channel, N], where N is 8. This is done to make access to the filter more cache friendly. As far as I understand, the following TVM trick uses a similar approach: http://tvmlang.org/2018/01/16/opt-mali-gpu.html (see tiling and packing).
WDYT about this kind of layout optimization? Have you played with something like this? Do you think it may result in even faster convolution kernels?

BTW, I'm testing on a MacBook Pro 2017 using the AMD GPU.

@naibaf7
Copy link
Owner

naibaf7 commented Apr 25, 2018

It could be possible to add different layouts as an option. I haven't had time to look into it so far.
Performance wise, on ImageNet, my Vega Frontier Edition using a batch size of 1024 gets close to the performance of a GTX 1080 now. Have you tried different batch sizes? Is there some breaking point where the GPU performance overtakes CPU performance? I can't imagine that for reasonable work sizes, the CPU would be faster.
Have you also checked it's actually using the AMD GPU on OpenCL, and not picking up either CPU or iGPU with OpenCL? What's your clinfo?

@romix
Copy link
Author

romix commented Apr 25, 2018

Have you also checked it's actually using the AMD GPU on OpenCL

Yes, I checked. It is using AMD Radeon Pro 555 Compute_Engine.

Have you tried different batch sizes? Is there some breaking point where the GPU performance overtakes CPU performance?

On the batches of size 1 and 2 the CPU version is faster. Starting with the batch of size 3 and above, libDNNs versions are faster. But, interestingly enough, they are not dramatically faster. E.g. with a batch of size 30, the GPU version is only about 1.2x faster than the CPU version.

@naibaf7
Copy link
Owner

naibaf7 commented Apr 25, 2018

How do you allocate the memory for filter, input and output?
You can also check how much % of your GPU's peak performance you can utilize.
Radeon Pro 555 should be around 1.3 TFlops with 80GB/s througput on the memory.
If the CPU is a i7 7700HQ, I'd expect 0.2-0.3 TFlops on a GEMM convolution. So even if the LibDNN convolution is a worst-case setup, I'd expect at least 0.6 TFlops from that GPU, which should in effect make the network at least twice as fast as on your CPU (using all cores for GEMM).
But for that you'd have to make sure the whole network runs on the GPU (no memory transfers before/after operations) and all memory is allocated as pinned on the GPU (no buffers allocated pinned on CPU memory and then used on the GPU).

But you're correct, a Radeon Pro 555 will not be dramatically faster than a CPU using all cores. The theoretical limit is about 4 times faster, in practice probably more like 2 times.

@romix
Copy link
Author

romix commented Apr 25, 2018

How do you allocate the memory for filter, input and output?

All the buffers are allocated in advance on the device. There is no dynamic memory allocation happening during the run.

To be more precise, I compute the total size of the required memory for all buffers in advance, then allocate a huge buffer big enough to hold this amount and then all the concrete buffers are assigned offsets inside this region. There is some simple logic at runtime to map (cl_mem address of the huge buffer, offset of the concrete buffer) to cl_mem address of the concrete buffer.

On CPU, all the buffers are allocated in advance as well in a similar way.

And as I mentioned, the network is running on a CPU using just a single core.

@naibaf7
Copy link
Owner

naibaf7 commented Apr 25, 2018

Can I see the implementation? Can you give me the convolution parameters and how long it takes to compute it? Make sure when taking the time of just one OpenCL operation to use appropriate commands to flush the OpenCL queue before stopping the timer.
If you use single core CPU, do you do convolutions using what method? Naive, BLAS-based (MKL? Apple specific BLAS library?)
I need a bit more info to get to the bottom of this issue here :)

@romix
Copy link
Author

romix commented Apr 25, 2018

I do test with the whole resnet50 and I run multiple iterations of it to make sure that the first run involving e.g. kernel compilations is not influencing the picture too much.

If you use single core CPU, do you do convolutions using what method? Naive, BLAS-based (MKL? Apple specific BLAS library?)

I do not use any BLAS-based methods. It is a direct convolution. As I mentioned in the first message of this issue, the CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout: filter = [depth/N, filter, filter, channel, N], where N is 8.

Can I see the implementation?

Not yet, but rather soon ;-) Most likely in 2 weeks or a month.

Can you give me the convolution parameters and how long it takes to compute it? Make sure when taking the time of just one OpenCL operation to use appropriate commands to flush the OpenCL queue before stopping the timer.

I'll try to collect more precise measurements.

@naibaf7
Copy link
Owner

naibaf7 commented Apr 25, 2018

Not yet, but rather soon ;-) Most likely in 2 weeks or a month.

Ok, the more I know the better I can help with performance related questions.
Alternatively I can suggest you try OpenCL Caffe performance on resnet50 as a comparison point:
https://github.com/BVLC/caffe/tree/opencl
(compile with ViennaCL, CLBlast and LibDNN enabled).

@romix
Copy link
Author

romix commented Apr 25, 2018

@naibaf7 I tried to build https://github.com/BVLC/caffe/tree/opencl locally. Now I'm trying to figure out how to run the resnet50 with it. I have some existing python scripts doing it, but they use Caffe2 python APIs (e.g. workspace) which are not available in this fork of Caffe.

Could you suggest how I can run the existing pre-trained models by providing the downloaded pb files?

@naibaf7
Copy link
Owner

naibaf7 commented Apr 25, 2018

Resnet50 can be obtained here, among other places:
https://github.com/cvjena/cnn-models/tree/master/ResNet_preact/ResNet50_cvgj
You can do a precise layer-wise benchmark like this:
./build/tools/caffe time -lt -gpu=0 -iterations=5 -model deploy.prototxt
You should also edit the file to increase the batch size (on line 8).
You can check which devices are present with ./build/tools/caffe device_query
Adjust the gpu parameter accordingly. The -lt flag shows you per-layer speeds, but will slow down the whole network a bit (due to flushing OpenCL command queues).

@romix
Copy link
Author

romix commented Apr 25, 2018

@naibaf7 Thanks! I downloaded these models. And I'm able to run them.

BTW, if I run on the OpenCL CPU backend, libDNN asserts:

ViennaCL: FATAL ERROR: Kernel start failed for 'conv_forward'.
libc++abi.dylib: terminating with uncaught exception of type viennacl::ocl::invalid_work_group_size: ViennaCL: FATAL ERROR: CL_INVALID_WORK_GROUP_SIZE
 The supplied work group size is invalid. If you have set this value manually, please reconsider your choice.
If you think that this is a bug in ViennaCL, please report it at [email protected] and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!
*** Aborted at 1524697855 (unix time) try "date -d @1524697855" if you are using GNU date ***
PC: @     0x7fff695e8b6e __pthread_kill
*** SIGABRT (@0x7fff695e8b6e) received by PID 34004 (TID 0x7fffa1ac5380) stack trace: ***
    @     0x7fff697a6f5a _sigtramp
    @            0x20008 (unknown)
    @     0x7fff695441ae abort
    @     0x7fff67448f8f abort_message
    @     0x7fff67449113 default_terminate_handler()
    @     0x7fff68880eab _objc_terminate()
    @     0x7fff674647c9 std::__terminate()
    @     0x7fff6746426f __cxa_throw
    @        0x1013a1dfa viennacl::ocl::error_checker<>::raise_exception()
    @        0x101441499 viennacl::ocl::enqueue<>()
    @        0x101460bd6 caffe::LibDNNConv<>::Forward()
    @        0x10157fd46 caffe::LibDNNConvolutionLayer<>::Forward_gpu()
    @        0x10152df95 caffe::Layer<>::Forward()
    @        0x1015dcbba caffe::Net<>::ForwardFromTo()
    @        0x1015dcaa2 caffe::Net<>::Forward()
    @        0x10136cf1c time()
    @        0x10136e7c1 main
    @     0x7fff69498015 start
    @                0x6 (unknown)
Abort trap: 6

This is most likely due to the fact that Apple's OpenCL on CPU has the max workgroup sizes like this (1024, 1, 1) and libDNN generated kernels require at least 16 along the second and third dimensions. This is not very important, as running on the CPU is probably a bad idea, but may be you want to fix it, so that libDNN can run on only OpenCL implementation.

Now, if I select AMD Radeon Pro 555 as a GPU device, the model runs properly, but I get the following exception when the program terminates:

ibc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
*** Aborted at 1524698125 (unix time) try "date -d @1524698125" if you are using GNU date ***
PC: @     0x7fff695e8b6e __pthread_kill
*** SIGABRT (@0x7fff695e8b6e) received by PID 35242 (TID 0x7fffa1ac5380) stack trace: ***
    @     0x7fff697a6f5a _sigtramp
    @        0x100020008 (unknown)
    @     0x7fff695441ae abort
    @     0x7fff67448f8f abort_message
    @     0x7fff67449113 default_terminate_handler()
    @     0x7fff68880eab _objc_terminate()
    @     0x7fff674647c9 std::__terminate()
    @     0x7fff67464843 std::terminate()
    @        0x108947766 boost::thread_specific_ptr<>::delete_data::operator()()
    @        0x108f0722e boost::detail::set_tss_data()
    @        0x108947865 boost::thread_specific_ptr<>::~thread_specific_ptr()
    @     0x7fff69544eed __cxa_finalize_ranges
    @     0x7fff695451fe exit
    @     0x7fff6949801c start
    @                0x6 (unknown)
Abort trap: 6

Again, it is not very important, because this happens after the run and after all the stats are printed.

I don't know how to set the batch size for Caffe from the command-line, so I assume it uses the default of 1.

The Caffe stats on the AMD GPU look like:

I0425 16:19:04.951920 2712425344 caffe.cpp:469] Average Forward pass: 302.822 ms.
I0425 16:19:04.951939 2712425344 caffe.cpp:471] Average Backward pass: 432.58 ms.
I0425 16:19:04.951949 2712425344 caffe.cpp:473] Average Forward-Backward: 744.575 ms.

Caffe stats for running on CPU:

I0425 16:28:35.998687 2712425344 caffe.cpp:469] Average Forward pass: 119.97 ms.
I0425 16:28:35.998692 2712425344 caffe.cpp:471] Average Backward pass: 92.6422 ms.
I0425 16:28:35.998697 2712425344 caffe.cpp:473] Average Forward-Backward: 213.1 ms.

My own resnet50 on the same GPU needs 136 ms for the forward pass.
And my resnet50 on the CPU (no OpenCL) needs 75ms for the forward pass.

So, overall, my implementations seem to be about 2x faster than Caffe.

We also see that on small batch sizes GPU versions are slower than their CPU counterparts (both in Caffe and in my implementations)

@naibaf7
Copy link
Owner

naibaf7 commented Apr 25, 2018

I've got the following performance numbers (batch size 16, average forward pass):

  • GTX 1080 @ CUDA+cuBLAS+cuDNN: 87.29 ms (5.45 ms per image)
  • GTX 1080 @ OpenCL+CLBlast+LibDNN: 132.053 ms (8.25 ms per image)
  • AMD Vega FE 16 GB @ OpenCL+CLBlast+LibDNN: 111.967 ms (6.99 ms per image)
  • AMD RX 480 8 GB @ OpenCL+CLBlast+LibDNN: 191.716 ms (11.98 ms per image)

Additonally, the AMD Vega FE 16 GB can handle a batch size of 32, then we get:

  • AMD Vega FE 16 GB @ OpenCL+CLBlast+LibDNN: 198.288 ms (6.20 ms per image)

If the batch size is too small (1) we get:

  • AMD Vega FE 16 GB @ OpenCL+CLBlast+LibDNN: 38.366 ms (38.366 ms per image)
    (which is a slowdown in throughput of 6.18, quite substantial)

The theoretical flops for these are:

  • GTX 1080: 9 TFlops
  • AMD Vega FE: 13 TFlops
  • AMD RX 480: 5.8 TFlops
  • AMD RX 555: 1.3 TFlops

As you can see, OpenCL+LibDNN is surprisingly close catching up to cuDNN with sufficient work sizes. Please test again using a larger batch size than 1. With batch sizes 1, the parallelism of GPUs can't quite come into effect. Just replace the 1 on line 8 in the deploy.txt to 4, 8 or 16.

According to the TFlops numbers, we'd expect to see 855.35 ms on your RX 555, respectively about 53.45 ms per image. That would indeed be faster than your CPU and GPU implementation. Now, since it only has 2 GB of RAM, groups of 16 are probably too big, but 8 or 4 should work.
It's very odd that instead of 53.45 ms you get an unreasonable 6 times lower score on your GPU. As you can see above, 6 times slower is what my Vega FE does on too small work sizes as well.

Maybe the first iteration also causes the number to go up due to compiling LibDNN kernels?

I0426 01:47:38.602185 36273 caffe.cpp:465] Iteration: 1 forward-backward time: 7006.18 ms.
I0426 01:47:39.855995 36273 caffe.cpp:465] Iteration: 2 forward-backward time: 1253.74 ms.

if that's the case, then choosing more iterations to average it out (i.e. 50 iterations) would show truer numbers:
./build/tools/caffe time -lt -gpu=0 -iterations=5 -model deploy.prototxt

Also thanks about the heads-up of the Apple OpenCL implementation. If they indeed only allow (1024, 1, 1) and no symmetric ranges on first and second dimension, then LibDNN will work abysmally bad on it, and I have no intent in fixing that. Even the Raspberry Pi VideoCore can handle normal sizes like (3,3,1). Apple needs to fix this.

I assume you use the same image size in your ResNet50 as in Caffe? For Caffe, the image size (spatial dimensions) used here were 224x224.

@romix
Copy link
Author

romix commented Apr 26, 2018

OK. Here are the numbers (batch size 16, average forward pass, 10 iterations):

  • AMD Radeon Pro 555 Caffe OpenCL + CLBlast + LibDNN: 2655.84
  • Caffe CPU: 1960.38 ms
  • AMD Radeon Pro 555 My OpenCL impl + LibDNN: 1196 ms
  • My impl CPU: 1366 ms

For batch size of 32:

  • AMD Radeon Pro 555 Caffe OpenCL + CLBlast + LibDNN: 8391.22 ms
  • Caffe CPU: 7973.71 ms
  • AMD Radeon Pro 555 My OpenCL impl + LibDNN: 2314 ms
  • My impl CPU: 2604 ms

I wonder why OpenCL + LibDNN start to win against CPU with increasing batch sizes in my implementations, but improve much slower in case of Caffe. Also, my implementations seem to scale linearly with the batch size, but Caffe implementations do not seem to follow the same pattern (e.g. they are about 3x-4x slower when batch size changes from 16 to 32)

Note:
I used the following pre-trained models to run my implementation
https://github.com/caffe2/models/tree/master/resnet50

I assume they are equivalent to the ones from https://github.com/cvjena/cnn-models/tree/master/ResNet_preact/ResNet50_cvgj

@romix
Copy link
Author

romix commented Apr 26, 2018

Also thanks about the heads-up of the Apple OpenCL implementation. If they indeed only allow (1024, 1, 1) and no symmetric ranges on first and second dimension, then LibDNN will work abysmally bad on it, and I have no intent in fixing that. Even the Raspberry Pi VideoCore can handle normal sizes like (3,3,1). Apple needs to fix this.

While I agree with you, I don't believe they'll fix it. It seems like Apple more or less abandoned any further OpenCL developments and concentrated on their Metal APIs. As for the (1024, 1, 1) limitations, I don't remember if it is always the case or of it kicks in only if a kernel uses memory fencing like barrier(CLK_LOCAL_MEM_FENCE);. There is an OpenCL API to get the max WG dimensions for a specific kernel and this is where it returns these strange limits.

I assume you use the same image size in your ResNet50 as in Caffe? For Caffe, the image size (spatial dimensions) used here were 224x224.

Yes, of course.

@naibaf7
Copy link
Owner

naibaf7 commented Apr 26, 2018

One factor could be memory allocation and total memory use. Batch size 32 certainly would not fit onto on-GPU memory (2GB) of your Radeon RX 555 in the case of Caffe.
I'll look into the details of why all that happens.
Certainly weird that the networks scale much better on my hardware than yours. I don't have (this is funny, isn't it...) such low-end parts as you have to test it though.

@romix
Copy link
Author

romix commented Apr 26, 2018

if that's the case, then choosing more iterations to average it out (i.e. 50 iterations) would show truer numbers:

I tried with more iterations, but get more or less the same numbers.

@romix
Copy link
Author

romix commented Apr 26, 2018

Batch size 32 certainly would not fit onto on-GPU memory (2GB) of your Radeon RX 555.

My implementation says it needs 805516800 bytes (about 768.2 MBs) for it. So, it fits into 2GB.

@romix
Copy link
Author

romix commented Apr 26, 2018

Certainly weird that the networks scale much better on my hardware than yours. I don't have (this is funny, isn't it...) such low-end parts as you have to test it though.

I suspect Apple's OpenCL implementation. May be it is too old or something like this... I really need to test it on a more high-end non-Apple hardware.

@naibaf7
Copy link
Owner

naibaf7 commented Apr 26, 2018

@romix That's very little for ResNet-50. Do you re-use the buffers during inference-only forward passes?
This: https://github.com/beniz/deepdetect/issues/84
And this: 7341467648 B (approx. 6.8 GB) is required to train ResNet-50 on Caffe. That would explain why you experience a slow-down when using such large batch sizes on Caffe, because memory would spill over to system memory.
I have a reduced-memory branch of Caffe for inference here, but it's not published yet. I'll notify you when it is.

@romix
Copy link
Author

romix commented Apr 26, 2018

Yes, my implementation performs a very aggressive buffers re-use during inference-only forward passes.

@romix
Copy link
Author

romix commented May 21, 2018

@naibaf7 Regarding the efficient convolution on a CPU. You asked if you could see the implementation, so that you can compare.

Please have a look at the following function: https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/libjit/libjit_conv.cpp#L233-L262

As I mentioned before, this convolution implementation on a CPU outperforms the GPU implementations produced by libDNNs on batches of size 1 and 2. If we increase the batch size, libDNN generated kernels start beating the CPU version and their lead slowly increases with increasing batch sizes.

@naibaf7
Copy link
Owner

naibaf7 commented May 21, 2018

Thanks, I'll check it out :)
Btw. there's now also aggressive buffer reuse as an optional flag for networks in Caffe.

@romix
Copy link
Author

romix commented May 21, 2018

Btw. there's now also aggressive buffer reuse as an optional flag for networks in Caffe.

Which commit is it? After a quick look, I could not find it.

@naibaf7
Copy link
Owner

naibaf7 commented May 21, 2018

naibaf7/caffe@411defe
Not in mainline Caffe yet, sorry. But hopefully soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants