Add single-precision kernel for calcMahalDistGpu

We have substantially simplified the kernel, which has made it faster and able to handle larger D. We now use this kernel for D <= 64. For larger D, we have switched from a triangular solve (TRSM) to computing the explicit inverse (TRTRI) and performing a general matrix multiplication (GEMM). This is because TRSM is much slower than GEMM on the GPU, despite the increased number of operations. For D > 2048, we stick with TRSM. Also, add -lmwlapack option to buildMexFiles, since calcMahalDistGpu needs to be linked to LAPACK for the triangular inverse.
Nabarb · May 1, 2017 · 9b921e3 · 9b921e3
1 parent 320d62a
commit 9b921e3
Show file tree

Hide file tree

Showing 2 changed files with 467 additions and 445 deletions.
diff --git a/@MoDT/buildMexFiles.m b/@MoDT/buildMexFiles.m
@@ -52,6 +52,7 @@ function buildMexFiles()
 % Compiler/linker options
 mexcuda_opts = {
     '-lcublas'                      % Link to cuBLAS
+    '-lmwlapack'                    % Link to LAPACK
     ['NVCCFLAGS="' nvcc_opts '"']
     ['CXXFLAGS="--compiler-options=' compile_opts '"']
     '-L/usr/local/cuda/lib64'       % Location of CUDA libraries