[AMD] Add buffer support #4716

giuseros · 2024-09-12T15:48:40Z

This PR is building on top of #4638 to finally add support for buffer operations. For now we will focus on buffer load/store, but in the future we might add more. What this PR is doing:

Adding inferred properties for non-negativeness (tt.non_negative) and the size of the memory buffers passed (tt.within_2gb)
Adding a series of checks to be sure we can emit the buffer load instructions (non negativeness, 32bitness of the offsets, etc..)
Change to the canonicalizer pointer pass to take into account the tt.within_2gb property
Add a generic infra to emit masked buffer ops. For now we will use it to emit masked buffer loads and stores, but in the future we might want to add more.
I am shielding this feature behind a AMDGCN_USE_BUFFER_OPS. In this way we can enable the feature gradually and check for possible performance/correctness issues.

ThomasRaoux

just putting a blocker on this as some pieces will need a bit more discussions. Lei had mentioned those to me so it's not a surprise but haven't had a chance to discuss it with Phil and the rest of the team yet

python/triton/compiler/compiler.py

python/triton/runtime/jit.py

ThomasRaoux · 2024-09-12T15:57:39Z

Do we have some data on the performance impact of this feature? Considering the cost in extra compilations and maintenance it would be good to have this information

ThomasRaoux · 2024-09-12T16:21:59Z

After having a quick chat with @ptillet one problem is that the specialization will apply to all backends even the ones that can't take advantage of that.
@antiagainst @giuseros, Can you make separate changes to refactor the specialization and allow different backends to have different specializations first?

giuseros · 2024-09-12T16:43:14Z

Do we have some data on the performance impact of this feature? Considering the cost in extra compilations and maintenance it would be good to have this information

Yes, if we run on a non-power-of-two shape we get up to 36% improvement:

These are gfx11 numbers, but I had the same with mi200 and mi300. For power-of-two shapes the perfs are similar

giuseros · 2024-09-12T16:49:19Z

just putting a blocker on this as some pieces will need a bit more discussion

Absolutely fine, I put it here exactly to have discussions while we get on with #4638

- The getPtrAlignment needs to run on a tensor - getPtrAlignment returns the alignment in terms of number of elements. We need to multiply by 8 to get the bytes - I renamed alignment to alignmentBytes to be clear on what we are computing

giuseros · 2024-09-30T10:46:32Z

I rebased against recent fixes/refactors/etc... I also found out that there is a benefit also when we use it on power-of-two sizes (this is still on my gfx11 card):

This seems to be because there is a reduction in the number of registers used (we don't have to have the set of pointers around. We only need to update the scalar base pointer - unless there is a non-uniform update within the loop, which is rare)

However, for "bad" configurations (i.e., the ones not picked by the tuner) sometimes I see an increase in reg pressure (using buffer ops). So I still think this feature needs to be shielded behind an environment variable to allow further experimentation.

giuseros requested review from antiagainst, zhanglx13 and ptillet as code owners September 12, 2024 15:48

giuseros changed the title ~~Add buffer support~~ [AMD] Add buffer support Sep 12, 2024

giuseros mentioned this pull request Sep 12, 2024

[AMD] Add buffer operation support #4277

Closed

ThomasRaoux requested changes Sep 12, 2024

View reviewed changes

python/triton/compiler/compiler.py Outdated Show resolved Hide resolved

python/triton/runtime/jit.py Outdated Show resolved Hide resolved

giuseros force-pushed the add_buffer_support_3 branch from fbeb3a6 to b426ce4 Compare September 13, 2024 16:33

giuseros mentioned this pull request Sep 16, 2024

Refactor compiler specializations to consider backend #4734

Open

giuseros added 9 commits September 27, 2024 14:24

Refactor compiler specializaitons

9499844

Address review feedback

e8213c0

Address review feedback - 2

e8c80e8

Address review feedback - 3

848f90d

Address review feedback - 4

c2b1d05

Address review feedback - 5

8559a04

[AMD][CanonicalizePointers] Propagate the attributes during the rewrites

63aa88e

Add alignment information to maskedLoad/maskedStore

f3116db

Set of fixes

99a66ec

- The getPtrAlignment needs to run on a tensor - getPtrAlignment returns the alignment in terms of number of elements. We need to multiply by 8 to get the bytes - I renamed alignment to alignmentBytes to be clear on what we are computing

giuseros force-pushed the add_buffer_support_3 branch from b426ce4 to 6dafa47 Compare September 30, 2024 10:25

Introduce support for buffer operations

8894638

giuseros force-pushed the add_buffer_support_3 branch from 6dafa47 to 8894638 Compare October 1, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add buffer support #4716

[AMD] Add buffer support #4716

giuseros commented Sep 12, 2024 •

edited

Loading

ThomasRaoux left a comment

ThomasRaoux commented Sep 12, 2024

ThomasRaoux commented Sep 12, 2024

giuseros commented Sep 12, 2024

giuseros commented Sep 12, 2024 •

edited

Loading

giuseros commented Sep 30, 2024 •

edited

Loading

[AMD] Add buffer support #4716

Are you sure you want to change the base?

[AMD] Add buffer support #4716

Conversation

giuseros commented Sep 12, 2024 • edited Loading

ThomasRaoux left a comment

Choose a reason for hiding this comment

ThomasRaoux commented Sep 12, 2024

ThomasRaoux commented Sep 12, 2024

giuseros commented Sep 12, 2024

giuseros commented Sep 12, 2024 • edited Loading

giuseros commented Sep 30, 2024 • edited Loading

giuseros commented Sep 12, 2024 •

edited

Loading

giuseros commented Sep 12, 2024 •

edited

Loading

giuseros commented Sep 30, 2024 •

edited

Loading