Using GemmKernels.jl in CUDA.jl #108

maleadt · 2023-06-27T13:06:07Z

Now that we have an FPU operator that ought to work on every GPU, we should start thinking about what needs to happen to use GemmKernels.jl in CUDA.jl for when CUBLAS isn't available. There's a couple of minor issues that we need to figure out first, so let's keep track of those here:

Support for small inputs: Errors on small array inputs #52
Support for arbitrary input sizes: the BLAS wrapper needs to select appropriate tile sizes (AFAIU each level's shape need to be divisible by the lower-level), but it may be better to keep pow2 shapes internally and just mask out global memory reads (this is what CUTLAS does)
Support for arbitrary input types: The FPUOperator currently does not like, e.g., Float16xFloat32=Float32
Support for arbitrary input objects, e.g., a Diagonal or ReshapedArray (without having to specialize the implementation)
Automatic selection of the best operator and kernel: WMMA when possible, FPU otherwise
Improve FPUOperator Float16 accuracy: It is really bad compared to CUBLAS
(optionally) Some basic (hard-coded) tuning
(optionally) Improved benchmarks that can be run on every commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GemmKernels.jl in CUDA.jl #108

Using GemmKernels.jl in CUDA.jl #108

maleadt commented Jun 27, 2023 •

edited

Loading

Using GemmKernels.jl in CUDA.jl #108

Using GemmKernels.jl in CUDA.jl #108

Comments

maleadt commented Jun 27, 2023 • edited Loading

maleadt commented Jun 27, 2023 •

edited

Loading