Equivalent of cuSparse - start with sparse matvec #208

ViralBShah · 2023-06-19T08:10:25Z

I believe there is currently no sparse matrix capability in Metal.jl. What is the easiest way to get some basic things working?

Perhaps a bigger question is whether we can have a generic sparse matrix implementation that can work on all our GPU backends.

maleadt · 2023-06-19T09:00:58Z

Metal's performance shaders library does not support sparse arrays. Apple Accelerate does, but that's for the CPU. Maybe that's good enough, though (with the memory being unified anyway)?

A generic implementation would be nice, but I don't have much experience with sparse algorithms. What operations would be important?

ViralBShah · 2023-06-22T12:15:11Z

There is a FixedSparseCSC type that would be a good starting point. A good starting point might be:

A COO matrix format
Ability to convert back and fort from FixedSparseCSC
Matrix-vector and Matrix-multiplication kernels
Broadcast
Reductions and scans

maleadt · 2023-06-22T13:56:16Z

There is a FixedSparseCSC type that would be a good starting point.

What makes FixedSparseCSC (as opposed to the normal CSC array type) better suited for GPU acceleration? Generally the problem is how to parallelize (as there's fewer opportunities, naively only over a single dimension).

ViralBShah · 2023-06-22T14:23:00Z

I guess I am trying to figure out what is the right programming model to keep in mind here would be. Getting a fast sparse matvec (and getting Conjugate Gradient working) followed by a fast matmul would be a good starting point to explore what is possible.

I'll experiment with a few things and see how far I can get.

maleadt · 2023-06-22T14:35:11Z

There's some native kernels I wrote in CUDA.jl, https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/broadcast.jl, which use row/column iterators that 'zip' the multiple inputs. Thus, they parallelize across one dimension of the sparse input.

Multiplication is much more difficult though, as there isn't a straightforward dimension to accelerate over. (The crux of the issue is that we cannot have an efficient getindex to use on sparse inputs in a kernel, we need threads to iterate the row/column values).

maleadt · 2023-06-22T14:41:12Z

Also note that those CUDA.jl kernels are ideally suited to be ported to GPUArrays using KA.jl, once we start doing that, as they don't use any advanced CUDA features.

ViralBShah · 2023-07-29T21:19:52Z

From IterativeSolvers, cg(a,b) works when a is a dense matrix and b is a dense vector. Having a working cg for a sparse matrix would be interesting since it would open up the door to various iterative solvers on GPUs with low effort. In order to do that, the main operation is a sparse matvec.

After that, a sparse matmul is valuable.

Since CUDA uses CSR, perhaps we could just use that for Metal.jl as well.

jamblejoe · 2024-07-20T20:18:27Z

I am a bit lost on how to start working on such an implementation. Shall I use the kernel programming capabilities of Metal.jl or use KernelAbstractions.jl? In the latter case, where to put the functionality?

maleadt · 2024-07-22T07:24:54Z

Implementing this in GPUArrays.jl using KernelAbstractions.jl (starting from JuliaGPU/GPUArrays.jl#525) is probably the best thing. One complication is where and how to define the concrete sparse array types that will be needed for this; I guess we will need some new type hierarchy for host and device sparse arrays backed by GPU memory that packages like CUDA.jl can then provide concrete implementations for (but with well-defined interfaces such that generic kernels defined in GPUArrays.jl can operate on them).

jamblejoe · 2024-07-22T19:55:53Z

After diving a bit into approaches of how to parallelize sparse matrix-vector multiplications (spmv) and sparse matrix-matrix multiplications (spmm) I realized that this is a way more complicated (and rich) field than I thought last week.

Julia's standard format for sparse matrices is CSC. Naively parallizing the serial mv multiplication algorithm

# y = A * x; A = sparse(colptr, rowval, nzval)
function kernel(y,colptr,rowval,nzval,x)
	i = thread_position_in_grid_1d()
	x_i = x[i]
	for j in colptr[i]:(colptr[i+1]-1)
		y[rowval[j]] += nzval[j] * x_i
	end
end

results in a race condition due to simultaneous writes into y (y[rowval[j]]). As a simple fix, I would suggest starting with implementing CSR format first. At least this gives the correct result:

function kernel(y,rowptr,colval,nzval,x)
	i = thread_position_in_grid_1d()
	for j in rowptr[i]:(rowptr[i+1]-1)
		y[i] += nzval[j] * x[colval[j]]
	end
end

Of course, this is not optimal as the accesses to x are non-contiguous. There are several proposed solutions. Some are reported in an 2008 paper by NVIDIA link, but newer and more sophisticated ones are available. Also presented are spmv algorithms for several other sparse matrix formats, but not CSC. A quick search about highly-parallel CSC spmv did not give any reasonable results. I stumbled across the MKL take on CSC spmv and found that, apparently, Intel did not bother parallelizing it (see here).

I tried to check what NVIDIA does with the CSC format. I do not have direct access to a NVIDIA device currently, so I tried google colab. I could not find any significant runtime differences of spmv between CSC and CSR spmv (see this gist). So apparently, NVIDIA has found a good solution for spmv with CSC format.

ViralBShah changed the title ~~Equivalent of cuSparse~~ Equivalent of cuSparse - start with sparse matvec Jul 29, 2023

maleadt added enhancement New feature or request libraries Things about libraries and how we use them. labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent of cuSparse - start with sparse matvec #208

Equivalent of cuSparse - start with sparse matvec #208

ViralBShah commented Jun 19, 2023

maleadt commented Jun 19, 2023

ViralBShah commented Jun 22, 2023

maleadt commented Jun 22, 2023

ViralBShah commented Jun 22, 2023 •

edited

Loading

maleadt commented Jun 22, 2023

maleadt commented Jun 22, 2023

ViralBShah commented Jul 29, 2023 •

edited

Loading

jamblejoe commented Jul 20, 2024

maleadt commented Jul 22, 2024

jamblejoe commented Jul 22, 2024

Equivalent of cuSparse - start with sparse matvec #208

Equivalent of cuSparse - start with sparse matvec #208

Comments

ViralBShah commented Jun 19, 2023

maleadt commented Jun 19, 2023

ViralBShah commented Jun 22, 2023

maleadt commented Jun 22, 2023

ViralBShah commented Jun 22, 2023 • edited Loading

maleadt commented Jun 22, 2023

maleadt commented Jun 22, 2023

ViralBShah commented Jul 29, 2023 • edited Loading

jamblejoe commented Jul 20, 2024

maleadt commented Jul 22, 2024

jamblejoe commented Jul 22, 2024

ViralBShah commented Jun 22, 2023 •

edited

Loading

ViralBShah commented Jul 29, 2023 •

edited

Loading