[AMD] Reland instruction scheduling hint changes #4940

ravil-mobile · 2024-10-17T09:55:49Z

This commit relands #4819
with the following fixes:

Replaced temlate-based rewindUnaryOps to use regular for-loops. The new way is more robust and can handle other unary ops automatically.
Replaced instr.sched.barriers using the ones from rocdl dialect from the MLIR upstream
Extended lit tests

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp

antiagainst

Thanks @ravil-mobile! A few comments.

lib/Conversion/TritonGPUToLLVM/Utility.cpp

third_party/amd/backend/compiler.py

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

This reverts commit 93de426.

Replaced temlate-based impl. of `rewindUnaryOps` in `SchedInstructions.cpp` using regular for-loops. The new impl. is more robust and can handle other unary ops automatically.

* add a test for the presence of OpIdx attribute

The extra check tests whether the data are loaded from HBM using `buffer_load` instructions. The CKV3 scheduling is skipped if the check fails.

zhanglx13 · 2024-10-30T03:33:31Z

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp

+    mod.walk([this, ctx](scf::ForOp forOp) {
+      // Note, instruction schedule barriers are inserted only in the case of
+      // a single `tt.dot` op in a `scf::ForOp` scope in the current
+      // implementation.


Does it mean for attention like kernels, i.e. chained dot in the main loop, we will not have sched.barrier inserted?
May I ask why this is the case?
I'm asking because I see better register pressure if sched.barrier is inserted (yes, two tt.dot means we have two sched.barrier at beginning and end of the loop) for flash-attention kernels.

Hi @zhanglx13,

Does it mean for attention like kernels, i.e. chained dot in the main loop, we will not have sched.barrier inserted?
May I ask why this is the case?

Yes, you are correct. It is the state of the current implementation. There is still some work which need to be done even for a single tt.DotOp per block. The FA kernel is one of the next steps. I would suggest to work iteratively. I am sure that there are going to be some challenges with instruction scheduling for the FA-like kernels

zhanglx13 · 2024-10-30T03:37:53Z

I'm also curious if the current instruction count mechanism can handle the case when the backend decides to combine two ds_read_b64 into one ds_read2st_b64?

ravil-mobile · 2024-10-30T09:20:51Z

I'm also curious if the current instruction count mechanism can handle the case when the backend decides to combine two ds_read_b64 into one ds_read2st_b64?

Hi @zhanglx13, no the current implementation cannot handle it. Instruction folding happens during instruction selection at the level of our compiler-backend. We have no explicit control of it from the MLIR level. This issue should be addressed in one of the follow-up tickets.

giuseros

LGTM

This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests

ravil-mobile force-pushed the ravil/bug-fix branch from 48cc87a to dfcb55e Compare October 17, 2024 09:56

ravil-mobile changed the title ~~Ravil/bug fix~~ [AMD] Fixed a bug resulted in reverting PR#4919 Oct 17, 2024

antiagainst mentioned this pull request Oct 17, 2024

[AMD] Use rocdl instr.sched.barriers from upstream MLIR/ROCDL #4939

Closed

7 tasks

antiagainst changed the title ~~[AMD] Fixed a bug resulted in reverting PR#4919~~ [AMD] Reland instruction scheduling hint changes Oct 17, 2024

ravil-mobile force-pushed the ravil/bug-fix branch 2 times, most recently from 5b044e9 to 73b15e8 Compare October 18, 2024 09:57

ravil-mobile marked this pull request as ready for review October 18, 2024 09:58

ravil-mobile requested review from antiagainst, zhanglx13 and ptillet as code owners October 18, 2024 09:58

giuseros reviewed Oct 18, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUToLLVM/SchedInstructions.cpp Outdated Show resolved Hide resolved

ravil-mobile force-pushed the ravil/bug-fix branch 2 times, most recently from 4cb27d1 to 00ab1fe Compare October 22, 2024 10:59

giuseros reviewed Oct 22, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp Outdated Show resolved Hide resolved

ravil-mobile force-pushed the ravil/bug-fix branch from 00ab1fe to 97f0e1e Compare October 22, 2024 12:54

ravil-mobile requested a review from giuseros October 22, 2024 13:17

ravil-mobile force-pushed the ravil/bug-fix branch 6 times, most recently from 875275b to b46f772 Compare October 24, 2024 16:01

antiagainst requested changes Oct 25, 2024

View reviewed changes

ravil-mobile force-pushed the ravil/bug-fix branch 3 times, most recently from e622e4f to 18ebc7a Compare October 28, 2024 16:22

ravil-mobile requested a review from antiagainst October 28, 2024 16:24

ravil-mobile force-pushed the ravil/bug-fix branch 2 times, most recently from 354fd36 to 343603e Compare October 29, 2024 10:37

Revert "[AMD] revert optimizations (triton-lang#4919)"

4b50c48

This reverts commit 93de426.

ravil-mobile added 9 commits October 29, 2024 13:17

[AMD] use rocdl instr.sched.barriers from upstream MLIR/ROCDL

d53c499

[AMD] fixed a bug resulted in reverting PR#4919

ce02968

Replaced temlate-based impl. of `rewindUnaryOps` in `SchedInstructions.cpp` using regular for-loops. The new impl. is more robust and can handle other unary ops automatically.

[AMD] Moved annotateDotUsageOnLoadStore to stream pipeliner

2aecafa

[AMD] Fixed bug in setNumGeneratedGlobalLoads

088fbd9

* add a test for the presence of OpIdx attribute

[AMD] added additional check into createCKV3Schedule

c72bfb9

The extra check tests whether the data are loaded from HBM using `buffer_load` instructions. The CKV3 scheduling is skipped if the check fails.

[AMD] Udated tests for SchedInstructions passes

68e7fac

[AMD] Addressed comments of PR#4940

dd7d2c6

[AMD] Fixed propagation of OpIdx attribute in LoadToBufferLoad pass

a2f8874

[AMD] Fixed instruction.sched lit tests

ae8c3c8

ravil-mobile force-pushed the ravil/bug-fix branch from 343603e to ae8c3c8 Compare October 29, 2024 13:19

Drop unnecessary mlir::prefix and early return if none choice

2504666

antiagainst approved these changes Oct 29, 2024

View reviewed changes

zhanglx13 reviewed Oct 30, 2024

View reviewed changes

giuseros approved these changes Oct 31, 2024

View reviewed changes

Merge branch 'main' into ravil/bug-fix

63a552f

antiagainst merged commit ee5876c into triton-lang:main Oct 31, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Reland instruction scheduling hint changes #4940

[AMD] Reland instruction scheduling hint changes #4940

ravil-mobile commented Oct 17, 2024 •

edited

Loading

antiagainst left a comment

zhanglx13 Oct 30, 2024 •

edited

Loading

ravil-mobile Oct 30, 2024 •

edited

Loading

zhanglx13 commented Oct 30, 2024

ravil-mobile commented Oct 30, 2024

giuseros left a comment

[AMD] Reland instruction scheduling hint changes #4940

[AMD] Reland instruction scheduling hint changes #4940

Conversation

ravil-mobile commented Oct 17, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

zhanglx13 Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

ravil-mobile Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

zhanglx13 commented Oct 30, 2024

ravil-mobile commented Oct 30, 2024

giuseros left a comment

Choose a reason for hiding this comment

ravil-mobile commented Oct 17, 2024 •

edited

Loading

zhanglx13 Oct 30, 2024 •

edited

Loading

ravil-mobile Oct 30, 2024 •

edited

Loading