Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GemmKernels.jl in CUDA.jl #108

Open
4 of 8 tasks
maleadt opened this issue Jun 27, 2023 · 0 comments
Open
4 of 8 tasks

Using GemmKernels.jl in CUDA.jl #108

maleadt opened this issue Jun 27, 2023 · 0 comments

Comments

@maleadt
Copy link
Member

maleadt commented Jun 27, 2023

Now that we have an FPU operator that ought to work on every GPU, we should start thinking about what needs to happen to use GemmKernels.jl in CUDA.jl for when CUBLAS isn't available. There's a couple of minor issues that we need to figure out first, so let's keep track of those here:

  • Support for small inputs: Errors on small array inputs #52
  • Support for arbitrary input sizes: the BLAS wrapper needs to select appropriate tile sizes (AFAIU each level's shape need to be divisible by the lower-level), but it may be better to keep pow2 shapes internally and just mask out global memory reads (this is what CUTLAS does)
  • Support for arbitrary input types: The FPUOperator currently does not like, e.g., Float16xFloat32=Float32
  • Support for arbitrary input objects, e.g., a Diagonal or ReshapedArray (without having to specialize the implementation)
  • Automatic selection of the best operator and kernel: WMMA when possible, FPU otherwise
  • Improve FPUOperator Float16 accuracy: It is really bad compared to CUBLAS
  • (optionally) Some basic (hard-coded) tuning
  • (optionally) Improved benchmarks that can be run on every commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant