[AMD] Define an extract slice operation #4804

hmalgewatta · 2024-09-25T18:06:00Z

This commit introduces a extract_slice operation for AMD backend
to enable view a slice of a tensor in registers without data exchange.
It enables breaking down large tiles of tensors into smaller ones
for better instruction interleaving and scheduling.

This can be useful for hiding global memory latency when a global
load/store can be efficiently split into several loads/stores to be
overlapped with compute fo attention.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

antiagainst

Thanks! I have a bunch of inling comments. Major issues include refine the semantics of the op and adding more docs/tests for it.

python/test/unit/language/test_core.py

third_party/amd/lib/TritonAMDGPUDialectToLLVM/TritonAMDGPUToLLVMPatterns.cpp

third_party/amd/lib/TritonAMDGPUDialectToLLVM/ViewSliceOpToLLVM.cpp

hmalgewatta · 2024-10-02T22:09:29Z

Thanks for meticulously going through the code changes! I have a new commit to address the comments. And I have also moved the python test case to the new proposed location.
I left some parts in that test that are not necessary right now (taken from the original test_core.py) but might be useful if more tests were to be added. Let me know if I should remove them

antiagainst

Thanks, much better!

test/TritonGPU/amd/amd-viewslice-op.mlir

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td

third_party/amd/include/TritonAMDGPUToLLVM/PatternTritonAMDGPUToLLVM.h

third_party/amd/python/test/test_core.py

python/test/unit/language/test_core.py

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td

antiagainst

Great! Some final comments inlined. Also could you git merge origin/main so that I can trigger CI? Right now cannot due to not using latest main.

third_party/amd/python/test/test_core.py

oplavsic · 2024-10-08T14:59:09Z

@hmalgewatta thanks for working on this and @antiagainst thanks for detailed review and feedback. Looks great now :)

hmalgewatta · 2024-10-14T16:40:56Z

Hi @antiagainst in the most recent commit I renamed the test file, added code checking for non-static cases, added more lit tests for failing non-static cases and changes to avoid conflicts. I've also synced my fork with the main branch so that you can trigger CI

antiagainst

Please git merge origin/main again and resolve the conflicts.

third_party/amd/lib/Dialect/TritonAMDGPU/IR/Dialect.cpp

ThomasRaoux

The design of the op is significantly different than what I would have imagined so maybe I'm missing some context.
Let me know if my comment makes sense otherwise maybe we need to discuss this a bit more

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td

third_party/amd/lib/Dialect/TritonAMDGPU/IR/Dialect.cpp

ThomasRaoux · 2024-10-25T17:57:04Z

third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp

+    Location loc = op->getLoc();
+    auto srcTy = cast<RankedTensorType>(op.getSource().getType());
+    auto srcLayout = srcTy.getEncoding();
+    auto srcShape = srcTy.getShape();
+    auto resultTy = cast<RankedTensorType>(op.getType());
+    auto vals = unpackLLElements(loc, adaptor.getSource(), rewriter);
+    auto elemsPerThread = triton::gpu::getElemsPerThread(srcTy);
+    auto sizePerThread = triton::gpu::getSizePerThread(srcLayout);
+    auto totalSizePerThread = sizePerThread[0] * sizePerThread[1];
+    auto order = triton::gpu::getOrder(srcLayout);
+
+    // Calculate valid total number of workers in each dimension
+    auto shapePerCTA = triton::gpu::getShapePerCTATile(srcLayout, srcShape);
+    shapePerCTA[0] =
+        std::min(static_cast<unsigned>(srcShape[0]), shapePerCTA[0]);
+    shapePerCTA[1] =
+        std::min(static_cast<unsigned>(srcShape[1]), shapePerCTA[1]);
+
+    // Rank == 2 checked in the verifier
+    SmallVector<int64_t, 2> sizes;
+    for (auto i = 0; i < 2; ++i) {
+      sizes.push_back(resultTy.getDimSize(i));
+    }
+
+    auto offsets = op.getStaticOffsets();
+
+    // Calculate offsets and sizes in terms of CTA units.
+    std::vector<int64_t> CTAOffsets{offsets[0] / shapePerCTA[0],
+                                    offsets[1] / shapePerCTA[1]};
+    std::vector<int64_t> CTASizes{sizes[0] / shapePerCTA[0],
+                                  sizes[1] / shapePerCTA[1]};
+    std::vector<int64_t> CTAPerShape{srcShape[0] / shapePerCTA[0],
+                                     srcShape[1] / shapePerCTA[1]};
+
+    // The diagram above illustrates the graphical representation of the
+    // skipElems, tensorStride, and lastIdx variables.
+    auto skipElems = CTAOffsets[order[1]] *
+                         (elemsPerThread[order[0]] * sizePerThread[order[1]]) +
+                     CTAOffsets[order[0]] * totalSizePerThread;
+    auto tensorStride =
+        (CTAPerShape[order[0]] - CTASizes[order[0]]) * totalSizePerThread;
+    auto lastIdx =
+        (CTAOffsets[order[1]] + CTASizes[order[1]] - 1) *
+            elemsPerThread[order[0]] * sizePerThread[order[1]] +
+        (CTAOffsets[order[0]] + CTASizes[order[0]]) * totalSizePerThread;
+
+    assert(lastIdx <= vals.size());
+
+    SmallVector<Value> resultVals;
+    for (int i = skipElems; i < lastIdx; i += tensorStride) {
+      for (int j = 0; j < totalSizePerThread * CTASizes[order[0]]; ++j, ++i) {
+        assert(i < lastIdx);
+        resultVals.push_back(vals[i]);
+      }
+    }
+    Value ret = packLLElements(loc, this->getTypeConverter(), resultVals,
+                               rewriter, resultTy);
+
+    rewriter.replaceOp(op, ret);
+    return success();


this lowering makes assumptions on the layout and shape of the of the operands/destination that are strong than what is in the verifier right?
Where do we check that those are true. It is okay to fail lowering if we don't want to support some cases but we never want to miscompile.

Good catch! +1. We should check that each thread is still holding the same elements in the op verifier.

@antiagainst the additional constraints I've identified to be added to verifier is making sure the divisors are not zero. I think moving the assert (line 107) would create duplicated code in the verifier, but I can move it there if that's the preference. I'm stuck on how to check that elements each thread is holding is as same as the verifier. Could you point me to how this could be done? And also if there's any other checks I might have missed

Sorry for the late reply; gotten distracted previously.. After addressing #4804 (comment), looks the current logic should be enough to guarantee that after slicing, threads are handling slides of the original elements without exchange/duplication. So this reads fine to me now @ThomasRaoux. Let me know if you still think some parts are missing.

antiagainst · 2024-11-12T06:57:53Z

third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp

+    auto offsets = op.getStaticOffsets();
+
+    // Calculate offsets and sizes in terms of CTA units.
+    std::vector<int64_t> CTAOffsets{offsets[0] / shapePerCTA[0],


Prefer to not mix std::vector with SmallVector. Just using std::array<2> works here.

antiagainst · 2024-11-13T19:17:39Z

third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp

+    Location loc = op->getLoc();
+    auto srcTy = cast<RankedTensorType>(op.getSource().getType());
+    auto srcLayout = srcTy.getEncoding();
+    auto srcShape = srcTy.getShape();
+    auto resultTy = cast<RankedTensorType>(op.getType());
+    auto vals = unpackLLElements(loc, adaptor.getSource(), rewriter);
+    auto elemsPerThread = triton::gpu::getElemsPerThread(srcTy);
+    auto sizePerThread = triton::gpu::getSizePerThread(srcLayout);
+    auto totalSizePerThread = sizePerThread[0] * sizePerThread[1];
+    auto order = triton::gpu::getOrder(srcLayout);
+
+    // Calculate valid total number of workers in each dimension
+    auto shapePerCTA = triton::gpu::getShapePerCTATile(srcLayout, srcShape);
+    shapePerCTA[0] =
+        std::min(static_cast<unsigned>(srcShape[0]), shapePerCTA[0]);
+    shapePerCTA[1] =
+        std::min(static_cast<unsigned>(srcShape[1]), shapePerCTA[1]);
+
+    // Rank == 2 checked in the verifier
+    SmallVector<int64_t, 2> sizes;
+    for (auto i = 0; i < 2; ++i) {
+      sizes.push_back(resultTy.getDimSize(i));
+    }
+
+    auto offsets = op.getStaticOffsets();
+
+    // Calculate offsets and sizes in terms of CTA units.
+    std::vector<int64_t> CTAOffsets{offsets[0] / shapePerCTA[0],
+                                    offsets[1] / shapePerCTA[1]};
+    std::vector<int64_t> CTASizes{sizes[0] / shapePerCTA[0],
+                                  sizes[1] / shapePerCTA[1]};
+    std::vector<int64_t> CTAPerShape{srcShape[0] / shapePerCTA[0],
+                                     srcShape[1] / shapePerCTA[1]};
+
+    // The diagram above illustrates the graphical representation of the
+    // skipElems, tensorStride, and lastIdx variables.
+    auto skipElems = CTAOffsets[order[1]] *
+                         (elemsPerThread[order[0]] * sizePerThread[order[1]]) +
+                     CTAOffsets[order[0]] * totalSizePerThread;
+    auto tensorStride =
+        (CTAPerShape[order[0]] - CTASizes[order[0]]) * totalSizePerThread;
+    auto lastIdx =
+        (CTAOffsets[order[1]] + CTASizes[order[1]] - 1) *
+            elemsPerThread[order[0]] * sizePerThread[order[1]] +
+        (CTAOffsets[order[0]] + CTASizes[order[0]]) * totalSizePerThread;
+
+    assert(lastIdx <= vals.size());
+
+    SmallVector<Value> resultVals;
+    for (int i = skipElems; i < lastIdx; i += tensorStride) {
+      for (int j = 0; j < totalSizePerThread * CTASizes[order[0]]; ++j, ++i) {
+        assert(i < lastIdx);
+        resultVals.push_back(vals[i]);
+      }
+    }
+    Value ret = packLLElements(loc, this->getTypeConverter(), resultVals,
+                               rewriter, resultTy);
+
+    rewriter.replaceOp(op, ret);
+    return success();


Sorry for the late reply; gotten distracted previously.. After addressing #4804 (comment), looks the current logic should be enough to guarantee that after slicing, threads are handling slides of the original elements without exchange/duplication. So this reads fine to me now @ThomasRaoux. Let me know if you still think some parts are missing.

antiagainst · 2024-11-13T19:18:14Z

third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp

+  matchAndRewrite(amdgpu::ExtractSliceOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    auto srcTy = op.getSource().getType();
+    if (isa<BlockedEncodingAttr>(op.getSource().getType().getEncoding()) ||


isa can support multiple attributes in <...>.

ThomasRaoux

So this reads fine to me now @ThomasRaoux. Let me know if you still think some parts are missing.

Yes I think this is fine. I got thrown off by some of the naming in the verifier but I think it is correct

ThomasRaoux · 2024-11-13T21:15:34Z

I approved but please address the remaining points before merging

ThomasRaoux · 2024-11-15T18:56:04Z

third_party/amd/lib/Dialect/TritonAMDGPU/IR/Dialect.cpp

+  }
+
+  auto srcShape = srcTy.getShape();
+  auto shapePerCTA = mlir::triton::gpu::getShapePerCTATile(srcLayout, srcShape);


I thought I commented on this but I don't see it anymore sorry if it is a duplicate. Can we find a better name for this variable. In the rest of the code shapePerCTA has a very different meaning, here you are only getting one "tile" of the shape per CTA, shapePerCTA means the sub tensor owned by a CTA

hmalgewatta · 2024-11-15T22:14:06Z

Sorry I pushed a version that was not well merged. I'll correct this

Introduces a new operation for amdgpus to slice a tensor in memory - Adds new TritonAMDGPUDialect operation ViewSliceOp - Adds verifier for ViewSliceOp - Adds conversion of the operation to llvm

…ew_slice

… new format

…ract slice

antiagainst requested changes Sep 30, 2024

View reviewed changes

antiagainst changed the title ~~[AMD] Adds view slice operation for amdgpus~~ [AMD] Deine a view_slice operation Sep 30, 2024

antiagainst changed the title ~~[AMD] Deine a view_slice operation~~ [AMD] Define a view_slice operation Sep 30, 2024

antiagainst requested changes Oct 5, 2024

View reviewed changes

hmalgewatta force-pushed the view-op-conversion branch 3 times, most recently from f93367b to 6a23263 Compare October 7, 2024 22:01

antiagainst requested changes Oct 8, 2024

View reviewed changes

third_party/amd/python/test/test_core.py Outdated Show resolved Hide resolved

third_party/amd/python/test/test_core.py Outdated Show resolved Hide resolved

hmalgewatta force-pushed the view-op-conversion branch from 931cd5c to 5897e00 Compare October 14, 2024 16:45

antiagainst requested changes Oct 14, 2024

View reviewed changes

third_party/amd/lib/Dialect/TritonAMDGPU/IR/Dialect.cpp Outdated Show resolved Hide resolved

hmalgewatta force-pushed the view-op-conversion branch 3 times, most recently from 393a6f0 to 6be6f71 Compare October 15, 2024 17:13

antiagainst marked this pull request as ready for review October 15, 2024 17:17

antiagainst requested review from zhanglx13 and ptillet as code owners October 15, 2024 17:17

antiagainst approved these changes Oct 15, 2024

View reviewed changes

ThomasRaoux reviewed Oct 15, 2024

View reviewed changes

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td Outdated Show resolved Hide resolved

hmalgewatta force-pushed the view-op-conversion branch 3 times, most recently from da4e954 to 6787e58 Compare October 23, 2024 20:34

antiagainst changed the title ~~[AMD] Define a view_slice operation~~ [AMD] Define an extract_slice operation Oct 23, 2024

antiagainst changed the title ~~[AMD] Define an extract_slice operation~~ [AMD] Define an extract slice operation Oct 23, 2024

antiagainst requested changes Oct 24, 2024

View reviewed changes

third_party/amd/lib/Dialect/TritonAMDGPU/IR/Dialect.cpp Show resolved Hide resolved

ThomasRaoux reviewed Oct 25, 2024

View reviewed changes

antiagainst requested changes Nov 13, 2024

View reviewed changes

ThomasRaoux approved these changes Nov 13, 2024

View reviewed changes

hmalgewatta force-pushed the view-op-conversion branch 2 times, most recently from fe79ae9 to f7d04fa Compare November 15, 2024 18:12

ThomasRaoux reviewed Nov 15, 2024

View reviewed changes

hmalgewatta force-pushed the view-op-conversion branch from f7d04fa to dda5813 Compare November 15, 2024 22:11

hmalgewatta added 11 commits November 19, 2024 04:18

[AMD] Adds Support For ViewSlice Operation

63cc720

Introduces a new operation for amdgpus to slice a tensor in memory - Adds new TritonAMDGPUDialect operation ViewSliceOp - Adds verifier for ViewSliceOp - Adds conversion of the operation to llvm

Adds lit test

f6f8555

Adds comments and formatting changes

daeac0e

Changes casting

167da80

Adds changes to address review comments

8e2b87a

Moves pytest, adds pytest to CI, verifies for static input args to vi…

df223cc

…ew_slice

Fixes non static check to handle both the attributes and input args

670d248

changes operation name and assembly format, modifies tests to reflect…

aa2f0b6

… new format

Adds bound checks for each dimension and renames files to reflect ext…

d87893f

…ract slice

Adds zero dimension check and related tests

7b20b6b

Refactors code (shapePerCTATile, isa<>)

36c425f

hmalgewatta force-pushed the view-op-conversion branch from 560b261 to 36c425f Compare November 19, 2024 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Define an extract slice operation #4804

[AMD] Define an extract slice operation #4804

hmalgewatta commented Sep 25, 2024 •

edited by antiagainst

Loading

antiagainst left a comment

hmalgewatta commented Oct 2, 2024

antiagainst left a comment

antiagainst left a comment

oplavsic commented Oct 8, 2024

hmalgewatta commented Oct 14, 2024

antiagainst left a comment

ThomasRaoux left a comment

ThomasRaoux Oct 25, 2024

antiagainst Oct 25, 2024

hmalgewatta Nov 5, 2024

antiagainst Nov 13, 2024

antiagainst Nov 12, 2024

antiagainst Nov 13, 2024

antiagainst Nov 13, 2024

ThomasRaoux left a comment

ThomasRaoux commented Nov 13, 2024

ThomasRaoux Nov 15, 2024

hmalgewatta commented Nov 15, 2024 •

edited

Loading

[AMD] Define an extract slice operation #4804

Are you sure you want to change the base?

[AMD] Define an extract slice operation #4804

Conversation

hmalgewatta commented Sep 25, 2024 • edited by antiagainst Loading

antiagainst left a comment

Choose a reason for hiding this comment

hmalgewatta commented Oct 2, 2024

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

oplavsic commented Oct 8, 2024

hmalgewatta commented Oct 14, 2024

antiagainst left a comment

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

ThomasRaoux commented Nov 13, 2024

Choose a reason for hiding this comment

hmalgewatta commented Nov 15, 2024 • edited Loading

hmalgewatta commented Sep 25, 2024 •

edited by antiagainst

Loading

hmalgewatta commented Nov 15, 2024 •

edited

Loading