You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that we have an FPU operator that ought to work on every GPU, we should start thinking about what needs to happen to use GemmKernels.jl in CUDA.jl for when CUBLAS isn't available. There's a couple of minor issues that we need to figure out first, so let's keep track of those here:
Support for arbitrary input sizes: the BLAS wrapper needs to select appropriate tile sizes (AFAIU each level's shape need to be divisible by the lower-level), but it may be better to keep pow2 shapes internally and just mask out global memory reads (this is what CUTLAS does)
Support for arbitrary input types: The FPUOperator currently does not like, e.g., Float16xFloat32=Float32
Support for arbitrary input objects, e.g., a Diagonal or ReshapedArray (without having to specialize the implementation)
Automatic selection of the best operator and kernel: WMMA when possible, FPU otherwise
Improve FPUOperator Float16 accuracy: It is really bad compared to CUBLAS
(optionally) Some basic (hard-coded) tuning
(optionally) Improved benchmarks that can be run on every commit
The text was updated successfully, but these errors were encountered:
Now that we have an FPU operator that ought to work on every GPU, we should start thinking about what needs to happen to use GemmKernels.jl in CUDA.jl for when CUBLAS isn't available. There's a couple of minor issues that we need to figure out first, so let's keep track of those here:
The text was updated successfully, but these errors were encountered: