Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Add buffer support #4716

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

giuseros
Copy link
Contributor

@giuseros giuseros commented Sep 12, 2024

This PR is building on top of #4638 to finally add support for buffer operations. For now we will focus on buffer load/store, but in the future we might add more. What this PR is doing:

  • Adding inferred properties for non-negativeness (tt.non_negative) and the size of the memory buffers passed (tt.within_2gb)
  • Adding a series of checks to be sure we can emit the buffer load instructions (non negativeness, 32bitness of the offsets, etc..)
  • Change to the canonicalizer pointer pass to take into account the tt.within_2gb property
  • Add a generic infra to emit masked buffer ops. For now we will use it to emit masked buffer loads and stores, but in the future we might want to add more.
  • I am shielding this feature behind a AMDGCN_USE_BUFFER_OPS. In this way we can enable the feature gradually and check for possible performance/correctness issues.

@giuseros giuseros changed the title Add buffer support [AMD] Add buffer support Sep 12, 2024
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just putting a blocker on this as some pieces will need a bit more discussions. Lei had mentioned those to me so it's not a surprise but haven't had a chance to discuss it with Phil and the rest of the team yet

python/triton/compiler/compiler.py Outdated Show resolved Hide resolved
python/triton/runtime/jit.py Outdated Show resolved Hide resolved
@ThomasRaoux
Copy link
Collaborator

Do we have some data on the performance impact of this feature? Considering the cost in extra compilations and maintenance it would be good to have this information

@ThomasRaoux
Copy link
Collaborator

After having a quick chat with @ptillet one problem is that the specialization will apply to all backends even the ones that can't take advantage of that.
@antiagainst @giuseros, Can you make separate changes to refactor the specialization and allow different backends to have different specializations first?

@giuseros
Copy link
Contributor Author

Do we have some data on the performance impact of this feature? Considering the cost in extra compilations and maintenance it would be good to have this information

Yes, if we run on a non-power-of-two shape we get up to 36% improvement:
image

These are gfx11 numbers, but I had the same with mi200 and mi300. For power-of-two shapes the perfs are similar

@giuseros
Copy link
Contributor Author

giuseros commented Sep 12, 2024

just putting a blocker on this as some pieces will need a bit more discussion

Absolutely fine, I put it here exactly to have discussions while we get on with #4638

@giuseros
Copy link
Contributor Author

giuseros commented Sep 30, 2024

I rebased against recent fixes/refactors/etc... I also found out that there is a benefit also when we use it on power-of-two sizes (this is still on my gfx11 card):
image
This seems to be because there is a reduction in the number of registers used (we don't have to have the set of pointers around. We only need to update the scalar base pointer - unless there is a non-uniform update within the loop, which is rare)

However, for "bad" configurations (i.e., the ones not picked by the tuner) sometimes I see an increase in reg pressure (using buffer ops). So I still think this feature needs to be shielded behind an environment variable to allow further experimentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants