[AMD] Count llvm instruction during conversion for scheduling hints #4819

ravil-mobile · 2024-09-27T17:20:57Z

[AMD] Advanced software pipelining may require fine-grain adjustments regarding instruction scheduling in the main tt.dot loop to achieve higher performance. Such adjustments require detailed information regarding the number of issued v_mfma, ds_read, ds_write and global_load, instructions. This PR extends the Triton AMDGPU backend by adding instruction counting during TritonAMDGPUToLLVM pass execution.

An example of instruction counting and instruction scheduling is demonstrated in the createCKV3Schedule method which implements the CK's V3 software pipelining.

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

antiagainst

I think overall this looks fine. But quite a few places we can simplify. Also need documentation and testing.

include/triton/Conversion/TritonGPUToLLVM/PatternTritonGPUOpToLLVM.h

lib/Conversion/TritonGPUToLLVM/Utility.cpp

lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

zhanglx13 · 2024-10-02T16:31:13Z

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

+
+  op->getBlock()->walk([&](amdgpu::InstructionSchedHint schedHint) {
+    schedHint.setNumMMAsAttr(counterAttr);
+  });


I'm wondering if this works when there are multiple tt.dot in the loop?

Hi @zhanglx13,

No, it is not going to work. The multiple tt.dot support would require further investigation and extensions.

Do you plan to generalize the design to support multiple tt.dot?
I'm asking because the pipelineV3 or CKV3 pipeline will prefetch the whole LDS buffer. However, the prefetchLDS pass can prefetch partial LDS buffer. But the prefetchLDS pass will lead to multiple tt.dot in the loop, each of which corresponds to one prefetched LDS sub-buffer.
The prefetchLDS pass will also need some sched_group_barrier tweak to "move things around".

Yeah I feel we may need to have more targeted instruction counting. The hint op is basically carrying side-channel information for the tt.dot; we can have one hint op immediately before/after a tt.dot for that tt.dot. It's a bit fragile but fine if we insert it at the proper time. Then we may need to build different schedules for different tt.dot ops (e.g., in main loop vs in epilogue or so). the instruction counting need to be more clever to figure out different "segments"..

antiagainst

Cool! Impl looks better now. Major missing pieces are still documentation and testing..

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.h

third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/DotOpToLLVM/WMMA.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/DotOpToLLVM/MFMA.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandWMMA.cpp

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUAttrDefs.td

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

antiagainst · 2024-10-03T06:25:25Z

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td

+  let arguments = (ins
+    I32Attr:$numDsReadsTileA,
+    I32Attr:$numDsReadsTileB,
+    I32Attr:$numDsWritesTileA,


I see thanks! You might want to put the link directly in the comment so it's easy to associate? (Right now what you have there is not a permlink.)

https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263

antiagainst · 2024-10-03T06:32:38Z

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

+
+  op->getBlock()->walk([&](amdgpu::InstructionSchedHint schedHint) {
+    schedHint.setNumMMAsAttr(counterAttr);
+  });


Yeah I feel we may need to have more targeted instruction counting. The hint op is basically carrying side-channel information for the tt.dot; we can have one hint op immediately before/after a tt.dot for that tt.dot. It's a bit fragile but fine if we insert it at the proper time. Then we may need to build different schedules for different tt.dot ops (e.g., in main loop vs in epilogue or so). the instruction counting need to be more clever to figure out different "segments"..

antiagainst

Implementation looks good now. Just need to add tests next:

Op tests for the new hint op
Conversion tests for the pass
etc.

include/triton/Conversion/TritonGPUToLLVM/PatternTritonGPUOpToLLVM.h

antiagainst · 2024-10-05T06:05:28Z

BTW, @ravil-mobile, when you address comments, please use separate commits; don't squash everything into one commit--otherwise reviewers are required to reread everything. Separate commits allows us to only read the delta easily. Also prefer to git merge origin/main to force push--it also helps speed up code reviews. Thanks! :)

ravil-mobile · 2024-10-07T08:54:58Z

BTW, @ravil-mobile, when you address comments, please use separate commits; don't squash everything into one commit--otherwise reviewers are required to reread everything. Separate commits allows us to only read the delta easily. Also prefer to git merge origin/main to force push--it also helps speed up code reviews. Thanks! :)

Agree. Makes sense

antiagainst

Can you also fix the failing tests?

test/TritonGPU/amd/amd-instruction-sched.mlir

…stat

… hints (#4819)" This reverts commit e87f877.

This commit relands #4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests

…riton-lang#4819) Advanced software pipelining may require fine-grained adjustments regarding instruction scheduling in the main `tt.dot` loop to achieve higher performance. Such adjustments require detailed information regarding the number of issued `v_mfma`, `ds_read`, `ds_write` and `global_load`, instructions. This PR extends the Triton AMDGPU backend by adding instruction counting during `TritonAMDGPUToLLVM` pass execution. An example of instruction counting and instruction scheduling is demonstrated in the `createCKV3Schedule` method which implements the [CK's V3 software pipelining](https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263). This change is experimental for better GEMM performance. The design is not final and may subject to change in the future.

This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests

…riton-lang#4819) Advanced software pipelining may require fine-grained adjustments regarding instruction scheduling in the main `tt.dot` loop to achieve higher performance. Such adjustments require detailed information regarding the number of issued `v_mfma`, `ds_read`, `ds_write` and `global_load`, instructions. This PR extends the Triton AMDGPU backend by adding instruction counting during `TritonAMDGPUToLLVM` pass execution. An example of instruction counting and instruction scheduling is demonstrated in the `createCKV3Schedule` method which implements the [CK's V3 software pipelining](https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263). This change is experimental for better GEMM performance. The design is not final and may subject to change in the future.

This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests

ravil-mobile commented Sep 27, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp Outdated Show resolved Hide resolved

antiagainst requested changes Sep 29, 2024

View reviewed changes

ravil-mobile force-pushed the ravil/sched-barriers-stat branch 7 times, most recently from aece96b to 06210a4 Compare October 1, 2024 15:32

ravil-mobile requested a review from antiagainst October 1, 2024 17:03

ravil-mobile force-pushed the ravil/sched-barriers-stat branch from ad5a4e1 to d32f444 Compare October 2, 2024 12:41

zhanglx13 reviewed Oct 2, 2024

View reviewed changes

antiagainst requested changes Oct 3, 2024

View reviewed changes

antiagainst changed the title ~~[AMD] instruction counting during TritonAMDGPUToLLVM pass~~ [AMD] Count llvm instruction during conversion for scheduling hints Oct 3, 2024

ravil-mobile force-pushed the ravil/sched-barriers-stat branch 9 times, most recently from d861c01 to cf97e35 Compare October 4, 2024 14:18

ravil-mobile requested review from zhanglx13 and antiagainst October 4, 2024 14:20

ravil-mobile force-pushed the ravil/sched-barriers-stat branch from cf97e35 to ea01f4b Compare October 4, 2024 14:24

ravil-mobile marked this pull request as ready for review October 4, 2024 16:46

ravil-mobile requested a review from ptillet as a code owner October 4, 2024 16:46

antiagainst requested changes Oct 5, 2024

View reviewed changes

include/triton/Conversion/TritonGPUToLLVM/PatternTritonGPUOpToLLVM.h Outdated Show resolved Hide resolved

ravil-mobile force-pushed the ravil/sched-barriers-stat branch from 2d9123e to 84f6c1b Compare October 7, 2024 08:54

ravil-mobile force-pushed the ravil/sched-barriers-stat branch 2 times, most recently from bf3443b to 32c91ac Compare October 7, 2024 16:15

ravil-mobile requested a review from antiagainst October 7, 2024 16:18

antiagainst requested changes Oct 8, 2024

View reviewed changes

ravil-mobile force-pushed the ravil/sched-barriers-stat branch 6 times, most recently from a06f1a7 to cbbc694 Compare October 10, 2024 15:09

ravil-mobile added 3 commits October 10, 2024 15:19

[AMD] Added instruction scheduling for the CK's V3 pipelining

97d354d

[AMD] added lit-test to test instruction sched.

49e2c18

[AMD] Extended annotateDotUsageOnLoadStore

23b5820

ravil-mobile force-pushed the ravil/sched-barriers-stat branch from cbbc694 to 23b5820 Compare October 10, 2024 15:27

ravil-mobile requested a review from antiagainst October 10, 2024 16:46

antiagainst added 2 commits October 13, 2024 05:22

Minor fixes

a8734fb

Merge remote-tracking branch 'origin/main' into ravil/sched-barriers-…

e6329af

…stat

antiagainst approved these changes Oct 13, 2024

View reviewed changes

antiagainst merged commit e87f877 into triton-lang:main Oct 13, 2024
7 checks passed

lezcano mentioned this pull request Oct 14, 2024

[Backend] Implement scaled_dot(mxfp4, fp8) #4904

Merged

ptillet added a commit that referenced this pull request Oct 16, 2024

Revert "[AMD] Count llvm instruction during conversion for scheduling…

0e27e37

… hints (#4819)" This reverts commit e87f877.

ptillet added a commit that referenced this pull request Oct 16, 2024

Revert "[AMD] Count llvm instruction during conversion for scheduling…

f02da8f

… hints (#4819)" This reverts commit e87f877.

antiagainst mentioned this pull request Oct 17, 2024

[AMD] Reland instruction scheduling hint changes #4940

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Count llvm instruction during conversion for scheduling hints #4819

[AMD] Count llvm instruction during conversion for scheduling hints #4819

ravil-mobile commented Sep 27, 2024 •

edited

Loading

antiagainst left a comment

zhanglx13 Oct 2, 2024

ravil-mobile Oct 2, 2024

zhanglx13 Oct 2, 2024

antiagainst Oct 3, 2024

antiagainst left a comment

antiagainst Oct 3, 2024

antiagainst Oct 3, 2024

antiagainst left a comment

antiagainst commented Oct 5, 2024

ravil-mobile commented Oct 7, 2024

antiagainst left a comment

[AMD] Count llvm instruction during conversion for scheduling hints #4819

[AMD] Count llvm instruction during conversion for scheduling hints #4819

Conversation

ravil-mobile commented Sep 27, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

zhanglx13 Oct 2, 2024

Choose a reason for hiding this comment

ravil-mobile Oct 2, 2024

Choose a reason for hiding this comment

zhanglx13 Oct 2, 2024

Choose a reason for hiding this comment

antiagainst Oct 3, 2024

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst Oct 3, 2024

Choose a reason for hiding this comment

antiagainst Oct 3, 2024

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst commented Oct 5, 2024

ravil-mobile commented Oct 7, 2024

antiagainst left a comment

Choose a reason for hiding this comment

ravil-mobile commented Sep 27, 2024 •

edited

Loading