CUDA basics

CUDA support and development in pytorch

Pytorch provides CUDA implementations for all its native functions.

Do I need to write a CUDA kernel?

Likely not, see below
If operation is expressed through existing pytorch ops, you don't need to, it will automatically be supported on cuda.
If it's a pointwise operation or reduction, you just need to define a functor to apply to each tensor element, and reuse existing TensorIterator kernels
Why not write my own kernel? It seems so easy!

Getting pointwise operation right is surprisingly non-trivial. Pytorch supports many features such as non-contiguous tensors, implicit broadcasting, type promotion. Pytorch is expected to handle tensors with more than INT_MAX elements, and it's easy to run afoul of that writing naive kernels with integer indexing. Your kernel would need to handle all this, at which point it won't be so easy. Besides, existing kernels apply performance optimizations such as unrolling and vectorization.
Even if it's an irregular operation such as indexing/scatter/gather, try to express it through iteration over tensor elements, in this case, indexing tensor elements and use TI, see examples in the existing indexing/scatter/gather implementations, it will likely be faster and will spare you painful debugging.
If it's compute intensive operation, such as matrix multiply or convolution, try using existing libraries - cublas and cudnn. Writing compute intensive code requires a lot of expertise
For common operations such as sort, inclusive/exclusive scan, unique elements use cub library. Don't use thrust!

If you are writing your kernel, try to use existing utilities to calculate the number of blocks, to perform atomic operations in the kernel, to perform reductions in the block. Additionally, cub also provides block-wide primitives that can be useful.
Avoid using raw cuda APIs, pytorch typically provides wrappers for those. NEVER allocate memory with cudaMalloc/cudaFree, use only caching allocator
Avoid host-device synchronizations (can happen if you are copying data from cpu to gpu and back, or call .item() on a tensor)
In pytorch core, codegen takes care of making sure that current device is the same as the device on which tensors are located, and that all arguments are on the same device. If you are writing out-of-core operations, you will need to take care of this yourself

Cuda execution is asynchronous, so backtrace you are getting from cuda error is likely pointing to the wrong place. Error message would typically suggest running with CUDA_LAUNCH_BLOCKING=1, do that!
Use cuda-memcheck and cuda-gdb to get more detailed information
you can use torch.cuda.set_sync_debug_mode to warn or error out on cuda synchronizations, if you are trying to understand where synchronizations are coming from in your workload or if you are accidentally synchronizing in your operations
Use pytorch built-in profiler (kineto) or nsys to get information on GPU utilization and most time-consuming kernels