-
Notifications
You must be signed in to change notification settings - Fork 23
CUDA
Currently we aren't using any custom-written CUDA code, but this tutorial was ported over from the old wiki.
CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce. Using CUDA, the latest Nvidia GPUs become accessible for computation like CPUs. Unlike CPUs, however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very quickly.
- Make sure the Linux distro is supported (check CUDA toolkit release notes).
- Make sure gcc is installed.
- Download and install the NVIDIA CUDA Toolkit. Refer to the Linux Getting Started Guide for installation instructions.
- **Note: **Steps needed not mentioned in above guide (using Ubuntu 10.04 and 12.04) include making symbolic links to libglut.so and libcuda.so (because of a difference in file names). It's usually solved by linking your libglut.so file (just apt-get install freeglut3 if you don't have it) to /usr/lib/ directory. Sometimes, the file name has to be changed from libglut.so.3 to libglut.so. Use this link for help. Also, use the nvidia driver that comes with the toolkit instead of whatever Ubuntu offers to install as proprietary driver.
CUDA code can be saved as a .cu file, and is compiled by the nvcc (from the installed CUDA toolchain). Obviously, the GPU runs the CUDA code (not the CPU). The terminology used is "host" (for the CPU) and "device" (for the GPU). Refer to the general procedure below.
Functions in CUDA are called "kernels" (will be used interchangeably here). Kernels are basically all the actual CUDA code. Note: some restrictions apply to what kernels can do (they can't be recursive or return any values or use variable number of arguments or use system memory). Kernel qualifiers are special keywords that indicate what type of kernel is being defined. They can be one of three types:
- global (these kernels are called by the host, and are run on the device; the basic kind of kernel that the host will invoke the device to run)
- device (these kernels are called by the device and run on the device; think of them as helper methods that are called within global kernels)
- host (normal host function; doesn't run on device. Basically normal C function that the host runs. Same if no qualifier is included) Furthermore, whenever the host calls a kernel, it has to specify certain launch configurations in triple brackets <<<numberOfBlocks, numberOfThreadsPerBlock>>>.
The device cannot access the host memory (RAM) directly, and in the same way, the host cannot access the device memory (GPU RAM). The solution around this is to copy the data back and forth from the host memory to the device memory (and vice versa). The method called to copy the data back and forth is:
- cudaMemcpy(void dest, void src, size_t sizeInBytes, enum direction);**
dest is the destination pointer, src is the source pointer, and direction is (usually) either cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost To allocate memory on the device, you use the cudaMalloc method (analogous to plain ol' C). The syntax is:
- cudaMalloc( void ** devPtr, size_t sizeInBytes);
This function basically reserves memory for a pointer and gives it a specific size in bytes. To deallocate memory on the device, you use the cudaFree method (again, same as plain C). The syntax is:
- cudaFree( vid ** devPtr);
This function releases memory previously allocated by cudaMalloc().
This concept is the final piece in the puzzle. Once I understand this, coding CUDA will be a breeze.
The basic procedure is to first Malloc the needed pointers (or data structures) on the device memory. Then, we copy the required data (the stuff that will go into the kernel) from the host memory to the device memory. We then call the kernel with the Malloc'ed pointers as arguments (there are also the blocks and threads configuration parameters). Afterwards, we copy the data back from the device memory to the host memory. Finally, we free the Malloc'ed pointers we made on the device memory.
Fill this in later on...