Skip to content

Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.

License

Notifications You must be signed in to change notification settings

alexkranias/triton_vs_cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

triton_vs_cuda

Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.

Lately I've been learning Triton, its strengths, and its weaknesses. Inspired by SiBohem's blog, I thought I would show how we can attempt to build a Triton kernel as performant as a near-cuBLAS performant CUDA kernel. In this endeavor I hope to highlight a few things about Triton:

  • what are the limitations of a Triton's block level programming paradigm?
  • as a kernel engineer, how much control do we retain in Triton to squeeze more performance out?
  • where does the Triton compiler take over and attempt to fill in? How successful is it at this task? Where is work still needed at the compiler level?
  • when should you actually use Triton v.s. CUDA?

Getting Started

I've divided this project into two branches:

  • main: template kernel files
  • solutions: solution kernel files

I've included dockerfiles in each /triton and /cuda directory to make enviornment setup quick and easy. Open those directories and you'll find README.mds explaining how to get going.

In Progress

I'll have a blog on the subject posted at some point on my personal website: alexkranias.com

I'm actively working on that piece.

In the meantime, you can clone this repo to work on this on your own and follow SiBohem's blog.

About

Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published