Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reduce block size of sumFramesGpu CUDA kernel
This commit reduces the maximum kernel block size so that (nominally) we can fit at least 3 blocks per multiprocessor (MP) instead of just 2. This was necessary because it turns out that it is register usage, rather than thread count or shared memory usage, that is limiting how many blocks can be launched per MP. Unfortunately, we don't have a way to programatically determine the register usage, so I am lowering the block size as a workaround. This commit also adds error-checking code so that such kernel launch errors are reported more transparently in the future.
- Loading branch information