Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing Shared Memory Usage #4756

Open
karan-dalal opened this issue Sep 18, 2024 · 1 comment
Open

Optimizing Shared Memory Usage #4756

karan-dalal opened this issue Sep 18, 2024 · 1 comment

Comments

@karan-dalal
Copy link

karan-dalal commented Sep 18, 2024

Hi. I am working on writing a Triton kernel for the backward pass of a sub-quadratic attention architecture. Currently, I'm receiving the following error when compiling the kernel:

triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 167936, Hardware limit: 166912. Reducing block sizes or `num_stages` may help.

The operations involved in the kernel are complex, and I have many loads and intermediate variables created during the derivation. I had a few questions on the SRAM usage inside the kernel:

  • Does the order of tl.load matter, or is Triton smart enough to compile it into the most memory optimal form. IE, can I tl.load all required variables at the beginning and expect the same memory usage as if were tl.load them right before the operation they were involved in?
  • Is there a way to forcibly evict a variable from shared memory after loading it, if I no longer need to use it?
  • If I use tl.store and tl.load in the same kernel, will this force triton to write it out to HBM and then reload it from HBM?
  • If I load x1 = tl.load(ptr) and then later load another variable into it x1 = tl.load(ptr2) will this overwrite the memory in SRAM?
  • Is there a way to understand memory usage breakdown in a compiled kernel?

Note: I'm using a simple grid of shape [Batch, Heads] (like Flash Attention). I don't think blocks or num stages is relevant.

I'm also happy to share the kernel code, if needed. Hopefully there's some way I can re-arrange operations and evict from SRAM to optimize usage.

@Li-dongyang
Copy link

Any progress? I am quite interested in this Shared Memory Usage issue. Would it be possible for you to share your code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants