-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] Reland instruction scheduling hint changes #4940
Conversation
48cc87a
to
dfcb55e
Compare
5b044e9
to
73b15e8
Compare
4cb27d1
to
00ab1fe
Compare
third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp
Outdated
Show resolved
Hide resolved
00ab1fe
to
97f0e1e
Compare
875275b
to
b46f772
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ravil-mobile! A few comments.
third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp
Outdated
Show resolved
Hide resolved
e622e4f
to
18ebc7a
Compare
354fd36
to
343603e
Compare
This reverts commit 93de426.
Replaced temlate-based impl. of `rewindUnaryOps` in `SchedInstructions.cpp` using regular for-loops. The new impl. is more robust and can handle other unary ops automatically.
* add a test for the presence of OpIdx attribute
The extra check tests whether the data are loaded from HBM using `buffer_load` instructions. The CKV3 scheduling is skipped if the check fails.
343603e
to
ae8c3c8
Compare
mod.walk([this, ctx](scf::ForOp forOp) { | ||
// Note, instruction schedule barriers are inserted only in the case of | ||
// a single `tt.dot` op in a `scf::ForOp` scope in the current | ||
// implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean for attention like kernels, i.e. chained dot in the main loop, we will not have sched.barrier inserted?
May I ask why this is the case?
I'm asking because I see better register pressure if sched.barrier is inserted (yes, two tt.dot
means we have two sched.barrier at beginning and end of the loop) for flash-attention kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhanglx13,
Does it mean for attention like kernels, i.e. chained dot in the main loop, we will not have sched.barrier inserted?
May I ask why this is the case?
Yes, you are correct. It is the state of the current implementation. There is still some work which need to be done even for a single tt.DotOp
per block. The FA kernel is one of the next steps. I would suggest to work iteratively. I am sure that there are going to be some challenges with instruction scheduling for the FA-like kernels
I'm also curious if the current instruction count mechanism can handle the case when the backend decides to combine two |
Hi @zhanglx13, no the current implementation cannot handle it. Instruction folding happens during instruction selection at the level of our compiler-backend. We have no explicit control of it from the MLIR level. This issue should be addressed in one of the follow-up tickets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests
This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests
This commit relands #4819
with the following fixes:
rewindUnaryOps
to use regular for-loops. The new way is more robust and can handle other unary ops automatically.instr.sched.barriers
using the ones fromrocdl
dialect from the MLIR upstream