README

About this repository

Purpose:

This repository offers a simple tutorial to concretely understand the basic principles of CUDA programming. Example scripts are provided for both Python and MATLAB languages.

Toy example:

Let us consider a 3D volume of size (depth, dim_x, dim_y). The objective of this dummy operation is to extract a smaller 3D volume of size (thickness, dim_x, dim_y), s.t. thickness < depth. This toy example is solved first via a GPU-accelerated parallel implementation, then via a CPU-based nested for loops implementation.

Prerequisite

See: https://docs.nvidia.com/cuda/index.html
CUDA-capable hardware (e.g., NVIDIA GPU)
Python and/or MATLAB
CUDA-capable software environment (please check your own compilers, libraries, and toolkits)
Note for struggling Debian/Fedora/Arch/... Linux users: as of today, Ubuntu is considered as being one of the most CUDA-compatible distro

How to compile CUDA files

See: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation
All CUDA files are located in the folder CUDA-tutorial > cuda_sourcecodes
A CUDA file (.cu) must be compiled via NVCC (Nvidia CUDA Compiler) into a Parallel Thread Execution file ( .ptx)
Identify the codename sm_xy that matches your GPU architecture (e.g., Turing is sm_75):
Run this command to compile the file name_of_the_cuda_file.cu into name_of_the_cuda_file.ptx (replace sm_xy by the value adapted to your GPU architecture):

./compile_pycuda_sourcecode.sh sm_xy name_of_the_cuda_file.cu

Run this command to compile all CUDA files contained in CUDA-tutorial > cuda_sourcecodes (replace sm_xy by your the value adapted to your GPU architecture):

./batch_compile_pycuda_sourcecode.sh sm_xy

How to run the example scripts

For Python, run:

CUDA-tutorial > main_python.py

For MATLAB, run:

CUDA-tutorial > main_matlab.m

About CUDA acceleration

Core idea:

Operations are carried out by the GPU, as opposed to the CPU
Operations are performed in parallel by Threads, as opposed to sequentially by nested for loops
Threads are organized within an upper-hierarchy of Blocks, themselves organized within an upper-hierarchy of Grid

Coordinates design:

The Grid is unique and therefore has no coordinates
Blocks have 3D coordinates within the Grid (blockIDx.x, blockIDx.y, blockIDx.z)
Threads have 3D coordinates within a Block (threadIDx.x, threadIDx.y, threadIDx.z)

Size design:

The Grid is defined by its 3D shape (gridDim.x, gridDim.y, gridDim.z)
Blocks are defined by their 3D shape (blockDim.x, blockDim.y, blockDim.z)
Threads are zero-dimensional and therefore have no shape

More advanced implementation techniques (not needed in this toy example and not covered in this tutorial):

Use atomicAdd(int* address, int val) to prevent race conditions between different threads within the same thread block
Use __syncthreads() to enforce a block-level synchronization barrier of parallel threads

Performance optimization (disregarded in this tutorial):

The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware
When possible, the total number of threads should be augmented to keep the total number of blocks low

Useful GPU-related Linux-based command lines

$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 SUPER] (rev a1)

$ watch -1 nvidia-smi
Every 1,0s: nvidia-smi
Mon Dec 27 12:29:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P5    12W / 175W |    148MiB /  7982MiB |     39%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1286      G   /usr/lib/xorg/Xorg                147MiB |
+-----------------------------------------------------------------------------+

$ sudo lshw -C display
*-display                 
   description: VGA compatible controller
   product: TU106 [GeForce RTX 2060 SUPER]
   vendor: NVIDIA Corporation
   physical id: 0
   bus info: pci@0000:01:00.0
   version: a1
   width: 64 bits
   clock: 33MHz
   capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
   configuration: driver=nvidia latency=0
   resources: irq:138 memory:a4000000-a4ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:3000(size=128) memory:a5000000-a507ffff

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cuda_sourcecodes		cuda_sourcecodes
matlab_routines		matlab_routines
python_routines		python_routines
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_matlab.m		main_matlab.m
main_python.py		main_python.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

About this repository

Purpose:

Toy example:

Prerequisite

How to compile CUDA files

How to run the example scripts

About CUDA acceleration

Core idea:

Coordinates design:

Size design:

More advanced implementation techniques (not needed in this toy example and not covered in this tutorial):

Performance optimization (disregarded in this tutorial):

Useful GPU-related Linux-based command lines

About

Releases

Packages

Contributors 2

Languages

License

GuillaumeZahnd/CUDA-tutorial

Folders and files

Latest commit

History

Repository files navigation

README

About this repository

Purpose:

Toy example:

Prerequisite

How to compile CUDA files

How to run the example scripts

About CUDA acceleration

Core idea:

Coordinates design:

Size design:

More advanced implementation techniques (not needed in this toy example and not covered in this tutorial):

Performance optimization (disregarded in this tutorial):

Useful GPU-related Linux-based command lines

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages