Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Reland instruction scheduling hint changes #4940

Merged
merged 12 commits into from
Oct 31, 2024

Conversation

ravil-mobile
Copy link
Contributor

@ravil-mobile ravil-mobile commented Oct 17, 2024

This commit relands #4819
with the following fixes:

  • Replaced temlate-based rewindUnaryOps to use regular for-loops. The new way is more robust and can handle other unary ops automatically.
  • Replaced instr.sched.barriers using the ones from rocdl dialect from the MLIR upstream
  • Extended lit tests

@ravil-mobile ravil-mobile changed the title Ravil/bug fix [AMD] Fixed a bug resulted in reverting PR#4919 Oct 17, 2024
@antiagainst antiagainst changed the title [AMD] Fixed a bug resulted in reverting PR#4919 [AMD] Reland instruction scheduling hint changes Oct 17, 2024
@ravil-mobile ravil-mobile force-pushed the ravil/bug-fix branch 2 times, most recently from 5b044e9 to 73b15e8 Compare October 18, 2024 09:57
@ravil-mobile ravil-mobile marked this pull request as ready for review October 18, 2024 09:58
@ravil-mobile ravil-mobile force-pushed the ravil/bug-fix branch 2 times, most recently from 4cb27d1 to 00ab1fe Compare October 22, 2024 10:59
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ravil-mobile! A few comments.

lib/Conversion/TritonGPUToLLVM/Utility.cpp Show resolved Hide resolved
third_party/amd/backend/compiler.py Show resolved Hide resolved
@ravil-mobile ravil-mobile force-pushed the ravil/bug-fix branch 3 times, most recently from e622e4f to 18ebc7a Compare October 28, 2024 16:22
@ravil-mobile ravil-mobile force-pushed the ravil/bug-fix branch 2 times, most recently from 354fd36 to 343603e Compare October 29, 2024 10:37
Replaced temlate-based impl. of `rewindUnaryOps` in
`SchedInstructions.cpp` using regular for-loops.
The new impl. is more robust and can handle
other unary ops automatically.
* add a test for the presence of OpIdx attribute
The extra check tests whether the data are loaded from HBM
using `buffer_load` instructions. The CKV3 scheduling is
skipped if the check fails.
mod.walk([this, ctx](scf::ForOp forOp) {
// Note, instruction schedule barriers are inserted only in the case of
// a single `tt.dot` op in a `scf::ForOp` scope in the current
// implementation.
Copy link
Collaborator

@zhanglx13 zhanglx13 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean for attention like kernels, i.e. chained dot in the main loop, we will not have sched.barrier inserted?
May I ask why this is the case?
I'm asking because I see better register pressure if sched.barrier is inserted (yes, two tt.dot means we have two sched.barrier at beginning and end of the loop) for flash-attention kernels.

Copy link
Contributor Author

@ravil-mobile ravil-mobile Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhanglx13,

Does it mean for attention like kernels, i.e. chained dot in the main loop, we will not have sched.barrier inserted?
May I ask why this is the case?

Yes, you are correct. It is the state of the current implementation. There is still some work which need to be done even for a single tt.DotOp per block. The FA kernel is one of the next steps. I would suggest to work iteratively. I am sure that there are going to be some challenges with instruction scheduling for the FA-like kernels

@zhanglx13
Copy link
Collaborator

I'm also curious if the current instruction count mechanism can handle the case when the backend decides to combine two ds_read_b64 into one ds_read2st_b64?

@ravil-mobile
Copy link
Contributor Author

I'm also curious if the current instruction count mechanism can handle the case when the backend decides to combine two ds_read_b64 into one ds_read2st_b64?

Hi @zhanglx13, no the current implementation cannot handle it. Instruction folding happens during instruction selection at the level of our compiler-backend. We have no explicit control of it from the MLIR level. This issue should be addressed in one of the follow-up tickets.

Copy link
Contributor

@giuseros giuseros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antiagainst antiagainst merged commit ee5876c into triton-lang:main Oct 31, 2024
7 checks passed
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests
guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants