Skip to content

OpenCL GPU Routines

Rok Češnovar edited this page Jan 28, 2020 · 25 revisions

OpenCL is an open-source framework for writing programs that utilize a platform with heterogeneous hardware. Stan uses OpenCL to design the GPU routines for the Cholesky Decomposition and it's derivative. Other routines will be available in the future. These routines are suitable for programs which require solving large NxM matrices (N>600) such as algorithms that utilize large covariance matrices.

Requirements

Users must have suitable hardware (e.g. Nvidia or AMD gpu) that supports OpenCL 1.2, valid OpenCL driver and a suitable C/C++ compiler installed on their computer.

Installation

Linux

The following guide is for Ubuntu, but it should be similar for any other Linux distribution. You should have the GNU compiler suite or clang compiler installed beforehand.

Install the Nvidia CUDA toolkit and clinfo tool if you have an Nvidia GPU

apt update
apt install nvidia-cuda-toolkit clinfo

Those with AMD devices can install the OpenCL driver available through

apt install -y libclc-amdgcn mesa-opencl-icd clinfo

If your device is not supported by the current drivers available you can try Paulo Miguel PPA

add-apt-repository ppa:paulo-miguel-dias/mesa 
apt-get update
apt-get install libclc-amdgcn mesa-opencl-icd

MacOS

Mac's should already have the OpenCL driver installed if you have the appropriate hardware.

Note that if you are building on a mac laptop you may not have a GPU device. You can still use the OpenCL routines for parallelization on your CPU.

Windows

Install the latest Rtools suite if you don't already have it. During the installation make sure that the 64 bit toolchain is installed. You also need to verify that you have the System Enviroment variable Path updated to include the path to the g++ compiler (<Rtools installation path>\mingw_64\bin).

If you have a Nvidia card, install the latest Nvidia CUDA toolkit. AMD users should use AMD APP SDK.

Users can check that their installation is valid by downloading and running clinfo.

Setting up the Math Library to run on a GPU

To turn on GPU computation:

  1. Check and record what device and platform you would like to use with clinfo; you will the platform and device index such as the printout below
clinfo -l
# Platform #0: Clover
# Platform #1: Portable Computing Language
#  `-- Device #0: pthread-AMD Ryzen Threadripper 2950X 16-Core Processor
# Platform #2: NVIDIA CUDA
#  +-- Device #0: TITAN Xp
#  `-- Device #1: GeForce GTX 1080 Ti
  1. In the top level of the math library, open a text file called make/local. If you are using OpenCL functionalities via Cmdstan, you can also open the text file in the make folder of Cmdstan (cmdstan/make/local). If it does not exist, create one.
  2. Add these lines to the make/local file:
STAN_OPENCL=true
OPENCL_DEVICE_ID=${CHOSEN_INDEX}
OPENCL_PLATFORM_ID=${CHOSEN_INDEX}

where the user will replace ${CHOSEN_INDEX} with the index of the device and platform they would like to use. In most cases these two will be 0. If you are using Windows append the following lines at the end of the make/local file in order to link with the appropriate OpenCL library:

  • Nvidia
CC = g++
LDFLAGS_OPENCL= -L"$(CUDA_PATH)\lib\x64" -lOpenCL
  • AMD
CC = g++
LDFLAGS_OPENCL= -L"$(AMDAPPSDKROOT)lib\x86_64" -lOpenCL

Running Tests with OpenCL

Once you have done the above step, runTests.py should execute with the GPU enabled. All tests will match the phrase *_opencl_* and tests can be filtered such as

./runTests.py test/unit -f opencl

Using the OpenCL backend

We currently have support for the following methods

  • bernoulli_logit_glm
  • cholesky_decompose
  • categorical_logit_glm
  • gp_exp_quad_cov
  • mdivide_right_tri
  • mdivide_left_tri
  • multiplication
  • neg_binomial_2_log_glm
  • normal_id_glm
  • ordered_logistic_glm
  • poisson_log_glm

TODO(Rok): provide example models for GLMs and GP

Troubleshooting

If you see the following error:

clBuildProgram CL_OUT_OF_HOST_MEMORY: Unknown error -6

you have most likely run of out available memory on your host system. OpenCL kernels are compiled just-in-time at the start of any OpenCL-enabled Stan/Stan Math program and thus may require more memory than when running without CPU support. If several CmdStan processes are started at the same time each process needs that memory for a moment. If there is not enough memory to compile OpenCL kernels, you will experience this error. Try running your model with less processes. Upgrading your GPU driver may also reduce the RAM usage for OpenCL kernel compilation.

Clone this wiki locally