Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use select for masked single element load #1770

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alexbaden
Copy link
Contributor

As part of my investigation into #1721, I found an opportunity to remove some branches in certain kernels by removing a predicated load in favor of a load+select. The nvidia path does something similar - the difference there is they can emit a masked load instruction. This patch removes quite a bit of branching from the sample kernel from #1721 reducing runtime from 0.090ms to 0.067ms.

I intend to do more testing as well as look for other optimizations in similar kernels, so I am marking this as draft for now but putting it up so it can run through CI (my local env is not setup for testing against IPEX).

@chengjunlu
Copy link
Contributor

@alexbaden
If the IGC cannot optimize the branching with the prediction, and the load with mask can be good in performance.

Please use the GenISA instead of load the memory without protection. It may causes the page fault in GPU if the address is not valid.

Here is the gather load with mask in GenISA.
https://github.com/intel/intel-graphics-compiler/blob/0d1c68f522be3684fa2a671f042bf5794dda1849/IGC/GenISAIntrinsics/generator/input/Intrinsic_definitions.yml#L9055

Here is the scatter store with the mask in GenISA.
https://github.com/intel/intel-graphics-compiler/blob/0d1c68f522be3684fa2a671f042bf5794dda1849/IGC/GenISAIntrinsics/generator/input/Intrinsic_definitions.yml#L9234

@whitneywhtsang
Copy link
Contributor

Had a discussion with IGC team, next step is for us to evaluate performance impact with the GenISA intrinsic, and open a IGC issue to have it exposed if it gives performance gain.

@etiotto
Copy link
Contributor

etiotto commented Aug 6, 2024

@alexbaden are you going to measure the performance impact of using the proposed IGC GenISA functions for that workload ?

@alexbaden
Copy link
Contributor Author

We tried the IGC function, but it is meant for masked loads for the render engine and not the GPGPU.
Using select on the other constant / load value seems to work in local testing, but we are not sure yet what the implications are of having a load instruction on what may be a null ptr / uninitialized memory, even if the result of that load is not used.
I did try selecting the ptr values (putting other into its own alloca), but the ptr has to be in global memory so this ended up being quite slow.
The current PR calls select on the value, not the ptr. Let's see if it passes CI and then decide how to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No perf advantage for torch.compile on examples from pytorch tutorial
4 participants