Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration to tinny-cnn #3

Open
edgarriba opened this issue Jul 18, 2016 · 100 comments
Open

integration to tinny-cnn #3

edgarriba opened this issue Jul 18, 2016 · 100 comments

Comments

@edgarriba
Copy link
Contributor

edgarriba commented Jul 18, 2016

@naibaf7 @bhack

I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.

Currently, I implemented a small interface to get native OpenCL context from tiny-cnn:
https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h

Things I think that are needed:

  • Implement a module for data transfer between devices
  • Discuss the shape of libdnn simplified interface if is needed.

BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. tiny-dnn/tiny-dnn#235

@naibaf7
Copy link
Owner

naibaf7 commented Jul 18, 2016

@edgarriba
Ok, I'll look into it. What is your idea about the data transfer and simplified interface?

@edgarriba
Copy link
Contributor Author

edgarriba commented Jul 18, 2016

@naibaf7
for memory syncing I was thinking in an approach similar to TF. What do you think it's the best to start?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/copy_tensor.cc#L46

Regarding simplified interface, just to make clear what type of data will need libdnn interfaces.

@bhack
Copy link

bhack commented Jul 18, 2016

@naibaf7 What do you think of this design?

@naibaf7
Copy link
Owner

naibaf7 commented Jul 18, 2016

@bhack
Not bad, but essentially no different from functions already existing in either ViennaCL or Caffe code.
It's really unfortunate that tiny-cnn can't handle any GPU related management itself at the moment, as both cuDNN and libDNN expect a proper GPU & memory initialization beforehand (and I oriented the libDNN interface closely to the cuDNN one).

@bhack
Copy link

bhack commented Jul 18, 2016

What do you want to be our responsabilities? Device, memory and context? Kernel launch?

@naibaf7
Copy link
Owner

naibaf7 commented Jul 18, 2016

@bhack
If you want compatibility with typical GPU libraries, then optimally device, memory and context. Kernel launch is handled enclosed in both cuDNN and libDNN, because this needs tricky parameter selection.

But if you can wait 2 days, I will update libDNN with simplifications for the device, memory and context part (basically copied over from Caffe). Basically just wrappers for memory allocation, device listing & initialization & memory transfer around CUDA and ViennaCL.

@bhack
Copy link

bhack commented Jul 18, 2016

We have no problem to handle device, memory and context. If you think that would be useful to have this features here ok. If not we will implement this in tiny. We need to think at the largest audience for this library. So if you think that generally callers benefit for handling context, device and memories itself for other kind of operations it is ok for us to implement in tiny and go ahead with other kernel coverage porting here.

@edgarriba
Copy link
Contributor Author

@naibaf7
I leave here the error report I get: http://pastebin.com/yv6rmu21

@naibaf7
Copy link
Owner

naibaf7 commented Jul 27, 2016

@edgarriba
What steps do you use to test the integration (just so I can replicate)...?
The error looks familiar from early OpenCL Caffe tests, so it should be something I can fix.

@edgarriba
Copy link
Contributor Author

edgarriba commented Jul 27, 2016

with this branch: https://github.com/edgarriba/tiny-cnn/tree/libdnn

  cmake -DBUILD_TESTS=ON -DUSE_OPENCL=ON -DUSE_LIBDNN=ON -DUSE_TBB=ON ..
  make && ./test/tiny_cnn_test

important routines:
https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/backend_dnn.h#L76
https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/kernels/libdnn_conv2d_kernel.h#L54

@naibaf7
Copy link
Owner

naibaf7 commented Jul 28, 2016

@edgarriba
Ok I gained a good understanding of what you are trying to do and what issues the code has. I filed a pull request on your tinycnn/libdnn branch which fixes some of the issues and explains some details of the problems that still exist.

If you need more assistance in fixing these, or want to discuss it in more details, we can schedule a Skype conversation with screen sharing and revise the code that way.

@hughperkins
Copy link

By the way, random observation, not sure if this is useful or not, but if you use the clew abstraction layer, then you can build for opencl, without libOpenCL.so being present, and you can even run your program, without libOpenCL.so being present. Your program can make a runtime-decision about whether to try binding with opencl or not. Actually, I think it can even attempt to call clewInit(), and thus detect if libOpenCL.so is present, and so on, and deal accordingly. (not sure about this last sentence, would be easy to fix this if it's not actually the case)

@naibaf7
Copy link
Owner

naibaf7 commented Aug 3, 2016

@hughperkins Intriguing!
Does that also work with CUDA?
Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :)
What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?

@hughperkins
Copy link

hughperkins commented Aug 4, 2016

Does that also work with CUDA?

No

Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :)

:-) Well, I geuss someone could create something analagous for cuda. Maybe... you? :-)

What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?

For libOpenCL.so, it searches in a couple of hard-coded locations:

https://github.com/hughperkins/clew/blob/master/src/clew.c#L180-L184

    module = CLEW_DYNLIB_OPEN("OpenCL.dll");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/Library/Frameworks/OpenCL.framework/OpenCL");
    if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so");
    if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so.1");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so.1");

You can see that there is no reason why this couldnt be extended arbitrarily and/or read from some config file. But this covers almost all, or perhaps all?, cases I've seen, I think?

@naibaf7
Copy link
Owner

naibaf7 commented Aug 4, 2016

@hughperkins
Yeah, I see...
yes maybe you could add possible locations for SDKs such as AMDs APP SDK libOpenCL.so and nVidias libOpenCL.so within CUDA :)
This also works fine with the OpenCL ICD loader I assume?

@bhack
Copy link

bhack commented Aug 4, 2016

@hughperkins
Copy link

Wow, you are just a mine of useful information bhack. Its amazing. Are you are AI? :-P

@naibaf7
Copy link
Owner

naibaf7 commented Aug 5, 2016

@bhack @hughperkins
More importantly, why does bhack know that much and take all the time to respond? :) amazing...
but yeah oh well, cuew and clew are two more items to integrate into libDNN then... endless list of work :)

@hughperkins
Copy link

This also works fine with the OpenCL ICD loader I assume?

yes. the sequence is:

clew => loads libopencl.so => reads icd => loads vendor drivers => loads device information etc

@bhack
Copy link

bhack commented Aug 5, 2016

@edgarriba Can you post the UML rendered image of the integration proposal?

@bhack
Copy link

bhack commented Aug 5, 2016

I don't think that Caffe it is the right testbed for libdnn cause libdnn was in some sense designed around Caffe. So if you want to give some feedback on Edgar design...

@edgarriba
Copy link
Contributor Author

@edgarriba
Copy link
Contributor Author

feedback is welcomed!

@bhack
Copy link

bhack commented Aug 5, 2016

Another interesting roadmap to monitor is https://phabricator.pmoreau.org/w/mesa/opencl_through_spirv_current_status/

@bhack
Copy link

bhack commented Aug 5, 2016

@CNugteren I think that you could be interested in the last messages.

@bhack
Copy link

bhack commented Aug 6, 2016

As you can see we have started to use CLCudaAPI for Tiny but also integrating libdnn for convolutional kernels. @CNugteren told that he was interested to contribute in libdnn. By this GSoC integration experiment I see a little bit of duplication of features. Libdnn actually maintain its tuning class and Cedric has CLtune. Libdnn uses ViennaCL as helper class for Opencl/cuda and Cedric has CLCudaAPI. Libdnn use some algebra from Vienna and Cedric from CLBlast. Is there a little bit of efforts duplication?

@naibaf7
Copy link
Owner

naibaf7 commented Aug 6, 2016

@bhack
Yes there certainly is a little bit of duplication, but I don't think that's an issue.
Inside Caffe I even use the duplication as an advantage and offer clBLAS, CLBlast and ViennaCL for the linear algebra part.
It's just that especially with the algebra, different libraries perform good or bad on different devices.

@bhack
Copy link

bhack commented Aug 6, 2016

We cannot share the tuning component? Also, I think CLCudaAPI it is neutral enougth to use Vienna and CLBast algebra.

@naibaf7
Copy link
Owner

naibaf7 commented Aug 6, 2016

@bhack
No the way the code generation works in libDNN is too different from the CLtune approach I think.
Yes, when you use CLCudaAPI you can also make use of multiple BLAS libraries, of course.

@bhack
Copy link

bhack commented Oct 5, 2016

Not so bad as a small scale test...

@naibaf7
Copy link
Owner

naibaf7 commented Oct 5, 2016

@bhack
Still funny that there is a small but consistent accuracy difference between the implementations; even after repeating the training many times :)... yeah it's good small scale test because repeating it only takes a few seconds up to a minute on slower devices.

@bhack
Copy link

bhack commented Oct 6, 2016

Yes remeber also of the old BVLC/caffe#3168

@bhack
Copy link

bhack commented Dec 9, 2016

@naibaf7 What about removing device and contrxt creation from libdnn standalone and require that this kind of responsibility will be handled by third party application and then passed to libnn? Could this help code sync between Caffe libdnn and standalone version?

@naibaf7
Copy link
Owner

naibaf7 commented Dec 9, 2016

@bhack
There already is no context and device creation in LibDNN - in fact you need to wrap an existing device into ViennaCL for LibDNN to be able to probe device capabilities and manage the device internally.
This is the central duplication of LibDNN with Caffe - inside the device class:
https://github.com/naibaf7/libdnn/blob/master/src/device.cpp
It furthermore simplifies alternation of device probing between CUDA and OpenCL.

The issue then is that I can't use a standalone LibDNN in Caffe because it would make compilation of Caffe a multistep process whereas it is very easy at the moment (when using ViennaCL and LibDNN, it's only one compile step and outside OS provided libraries, just header-only dependencies to Boost and ViennaCL). And then there'd be problems with duplicated functionality, such as the device class. Now I can just use the existing management for both Caffe and LibDNN, otherwise I'd also have to start re-wrapping the OpenCL device from Caffe::device to LibDNN::device.

@bhack
Copy link

bhack commented Dec 9, 2016

Sorry probably I've not explained well what I mean... It is possibile to pass only some device metadata to libdnn and to let execute and compile program on app side? Apps could comunicate timings to the tuner and retrive new source to compile and execute.

@naibaf7
Copy link
Owner

naibaf7 commented Dec 9, 2016

@bhack
Not sure what you mean - only getting the kernel code and launch the kernel in your own OpenCL context without using LibDNN launch code then?
The problem here is that the tuning parameters and kernel properties need to be known to launch the kernels, as well as the data and parameter integrity checks that happen on the host-side before the OpenCL/CUDA kernels are being launched. So no, code generation/tuning and kernel launching are pretty much inseparable - hence the use of wrappers similar to cuDNN and BLAS frameworks.

@bhack
Copy link

bhack commented Dec 9, 2016

Exactly what I mean is to exange only metadata. So we could excange only metadata forward and back from libdnn and third party apps. So we don't need to replicate bootstrap and comping and execution code cause libdnn has only partial coverage of layers and so generally every third party apps will have it is own setup, compile and launch code.

@naibaf7
Copy link
Owner

naibaf7 commented Dec 9, 2016

Yeah but look at this example on how complex launch code depends on kernel properties - I might be wrong here but I think wrapping a device to ViennaCL or any other wrapper is waay easier and more in line with how - for example cuDNN works - than keeping up with the launch parameter conformity for all kernel variants - a small excert here:

 int_tp ims = batch_size * channels;
  for (int_tp i = 0; i < im_in_shape_.size(); ++i) {
    ims *= im_in_shape_[i];
  }
  LibDNN<Dtype>::SetMemory(bottom_diff, ims, 0, (Dtype) 0);

  int_tp imsi = std::accumulate(im_in_shape_.begin(), im_in_shape_.end(),
                                1, std::multiplies<int_tp>());
  int_tp imso = std::accumulate(im_out_shape_.begin(), im_out_shape_.end(),
                                1, std::multiplies<int_tp>());

  int_tp imsw = 0;
  if (bwalgo_ == LIBDNN_POOLING_BW_ALGO_DIRECT) {
    // Direct kernel iterates over input size
    imsw = imsi;
  } else {
    // Atomic kernel iterates over output size
    imsw = imso;
  }

  int_tp lw0 = bw_tuner_->get_param<int_tp>("LW0");
  int_tp lw1 = bw_tuner_->get_param<int_tp>("LW1");

#ifdef USE_GREENTEA
  if (LibDNN<Dtype>::dev_ptr_->backend() == BACKEND_OpenCL) {
    viennacl::ocl::kernel &kernel =
        LibDNN<Dtype>::ocl_program_.get_kernel("pool_backward");
    viennacl::ocl::context &ctx =
        viennacl::ocl::get_context(LibDNN<Dtype>::dev_ptr_->id());

    kernel.local_work_size(0, lw0);
    kernel.local_work_size(1, lw1);
    kernel.local_work_size(2, 1);

    kernel.global_work_size(0, ((imsw - 1) / lw0 + 1) * lw0);
    kernel.global_work_size(1, ((channels * batch_size - 1) / lw1 + 1) * lw1);
    kernel.global_work_size(2, 1);

    switch (pool_method_) {
      case LIBDNN_POOLING_METHOD_MAX:
        if (use_top_mask_) {
          viennacl::ocl::enqueue(
                 kernel(WrapHandle((cl_mem) top_diff, &ctx),
                        WrapHandle((cl_mem) bottom_diff, &ctx),
                        WrapHandle((cl_mem) top_mask, &ctx),
                        channels,
                        batch_size),
                 ctx.get_queue());
        } else {
         viennacl::ocl::enqueue(
                kernel(WrapHandle((cl_mem) top_diff, &ctx),
                       WrapHandle((cl_mem) bottom_diff, &ctx),
                       WrapHandle((cl_mem) mask, &ctx),
                       channels,
                       batch_size),
                ctx.get_queue());
        }
        break;
      case LIBDNN_POOLING_METHOD_AVE:
        viennacl::ocl::enqueue(
               kernel(WrapHandle((cl_mem) top_diff, &ctx),
                      WrapHandle((cl_mem) bottom_diff, &ctx),
                      channels,
                      batch_size),
               ctx.get_queue());
        break;
      case LIBDNN_POOLING_METHOD_STO:
        viennacl::ocl::enqueue(
               kernel(WrapHandle((cl_mem) top_diff, &ctx),
                      WrapHandle((cl_mem) bottom_diff, &ctx),
                      WrapHandle((cl_mem) rand_idx, &ctx),
                      channels,
                      batch_size),
               ctx.get_queue());
        break;
    }
  }

You'd have to duplicate about this amount of code times 8 (cuda, opencl, forward, backward, convolution, pooling), and it will increase even more so with more kernel variants. Whereas with a wrapper, you need not worry about kernel variants at all - it will just work, and not many if any interface changes at all between versions of LibDNN.

@bhack
Copy link

bhack commented Dec 9, 2016

I don't know why we need to replicate all this code 8 times.. I.e. in our Tiny-dnn case where we use CLCUDAAPI we have for kernels this two interfaces:

template void SetArgument(const size_t index, const T &value): Method to set a kernel argument (l-value or r-value). The argument indexspecifies the position in the list of kernel arguments. The argument value can also be a CLCudaAPI::Buffer.

template <typename... Args> void SetArguments(Args&... args): As above, but now sets all arguments in one go, starting at index 0. This overwrites any previous arguments (if any). The parameter pack args takes any number of arguments of different types, including CLCudaAPI::Buffer.

Can this be asked to libdnn?

@naibaf7
Copy link
Owner

naibaf7 commented Dec 9, 2016

@bhack
Yeah you could launch the kernels with that, however you still need a variant for each of the different kernels. And if I add a new kernel variant or change the interface to the kernel, you'd have to adapt all that again. Building a metadata exchange system that explains in all detail how the kernel arguments need to be set & what needs to be prepared before the kernel can be launched seems far more complex than wrapping the OpenCL device. I quite frankly don't see the point of making this more complex than it is / needs to be. The current binding code in tiny-dnn to libdnn is quite compact to what it would be otherwise... the interface has a clear definition of what tensors are expected, what function should be executed and what the effects on the tensors are. Extending the configuration structure with new options will not break the compatibility with frameworks using libdnn.
What do others think about this? @CNugteren @edgarriba @hughperkins ?

@bhack
Copy link

bhack commented Dec 9, 2016

It is not a particular problem for tiny-dnn cause actually libdnn it is already a shared object for us. It was just a tentative to find a path to let caffe and tiny-dnn use libdnn directly as upstream and to let collect more variants here (i.e. Intel kernels).

@naibaf7
Copy link
Owner

naibaf7 commented Dec 9, 2016

@bhack
OK I see. I think the more sensible path I will take is to move a device abstraction framework from Caffe into LibDNN and base Caffe on LibDNN like that, but I don't have the time to try this modality out yet.
The metadata exchange would however definitely introduce too much maintenance and work overhead for me.
Adding kernel variants such as Intel kernels is not a big deal, it can be accomodated in LibDNN, but I think I need to coordinate this with @gongzg first. Since they don't have separate efficient backward kernels in their implementation yet it would be good to mix & match it with LibDNN backward kernels.

@bhack
Copy link

bhack commented Dec 9, 2016

Are you sure of that this is a plausibile path? You still needs all this stuffs for lauching kernels not covered by libdnn (full connected etc...) same as tiny-dnn.

@naibaf7
Copy link
Owner

naibaf7 commented Dec 9, 2016

@bhack
Yes, I think it is. Like MKL, cuDNN and clBLAS as well, LibDNN will stay an acceleration library providing fast implementations with a uniform interface where appropriate. It is not a complete DNN framework, so the rest remains Caffe's, torch's, tensor-flow's and tiny-dnn's responsibility.
Focus remains on stuff like performance portability, different compile backends (CUDA, OpenCL, eventually OpenMP), performance critical DNN kernels, easy integration to deep learning frameworks.

@bhack
Copy link

bhack commented Dec 9, 2016

Yes is what I mean.. how could you move device abstraction from caffe to libdnn? You still need that in caffe for other kernels just like tiny-dnn.

@bhack
Copy link

bhack commented Dec 30, 2016

@naibaf7 How do you think that XLA will impact this project?

@bhack
Copy link

bhack commented Dec 30, 2016

@naibaf7
Copy link
Owner

naibaf7 commented Dec 31, 2016

@bhack
Not sure yet, maybe I'll also do some work on it. I'll have to see it in action and take a look at what operations it can do so far etc. first.
I've spoken with AMD and will continue to get their support for what I'm doing currently.
I definitely made similar observations in performance improvement when I implemented a simpler (only high-level optimizer; target independent) variant to XLA for the pooling operations in LibDNN.

@bhack
Copy link

bhack commented Jan 11, 2017

@naibaf7 if you are interested take a look at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler

@naibaf7
Copy link
Owner

naibaf7 commented Feb 11, 2017

@bhack Bumped to latest kernel versions and added pooling kernels. Should be straight-forward to use again, don't hesitate to ask questions.
Hope this is useful for tiny-dnn.

@kaishijeng
Copy link

I have compiled and installed libdnn from github and got the following errors with tiny-dnn:

tiny-dnn/tiny_dnn/core/kernels/conv2d_op_libdnn.h:227:5: error: 'LibDNNConfig' is not a member of 'greentea'
greentea::LibDNNConfig config;

/tiny-dnn/tiny_dnn/core/kernels/conv2d_op_libdnn.h:229:5: error: 'config' was not declared in this scope
config.dev_ptr = dev_ptr_.get();

Any idea?

Thanks

@naibaf7
Copy link
Owner

naibaf7 commented Apr 30, 2017

Oh... no it's an incompability issue. You need to ask them to change "LibDNNConfig" to "LibDNNConvConfig" in the source code.
Because LibDNN now also supports Pooling and Deconvolution, but they haven't updated their code to use that new version of LibDNN yet.

@edgarriba
Copy link
Contributor Author

@naibaf7 thanks to point out, we'll try to update during the summer

@naibaf7
Copy link
Owner

naibaf7 commented Apr 30, 2017

@edgarriba
Yup and we should probably insert tags for compatibility. Do you use LibDNN as submodule with TinyDNN? If so, make sure that it always checks out the "latest known working" commit in accordance to your testing... LibDNN interface will likely not stay stable over the next few months, as many updates will be coming.

@edgarriba
Copy link
Contributor Author

right, we are not checking versions right now. It would be worth to do it. Opening an issue to fix that

@bhack
Copy link

bhack commented Aug 2, 2017

@naibaf7 We are planning to integrate soon libdnn with @Randl. Any feedbacks on last evolution? Is the library tagged?

@naibaf7
Copy link
Owner

naibaf7 commented Aug 3, 2017

@bhack
I am currently working for a very time intensitive project on Caffe/LibDNN, but the standalone LibDNN didn't receive any new updates since we communicated last.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants