Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

Open
etaf opened this issue Jul 17, 2024 · 9 comments · May be fixed by #2391
Open

[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

etaf opened this issue Jul 17, 2024 · 9 comments · May be fixed by #2391

Comments

@etaf
Copy link

etaf commented Jul 17, 2024

Hi, recently we want to follow the design of CUDA/ROCm to enable support for Inductor's dynamic_rblock_scaling, but we need to get number of registers used for a combined kernel like this:
https://github.com/pytorch/pytorch/blob/7c45476d38176c8d5b19fb379fc073dc21beba64/torch/_inductor/runtime/triton_heuristics.py#L273-L296

But currently Intel Triton always return n_regs = 0.
Can we return the n_regs like cuda does?

reference: ROCm is implementing dynamic_rblock_scaling like CUDA: pytorch/pytorch#129663

@alexbaden alexbaden self-assigned this Jul 19, 2024
@alexbaden
Copy link
Contributor

I attempted to use https://spec.oneapi.io/level-zero/latest/tools/api.html#zetkernelgetprofileinfo to store register use when compiling the kernel, but I am not having success getting zeModuleCreate to respect the profile flag (the register count always returns 0). I looked at SYCL APIs for getting registers and was not successful there either (I don't think privateMemory is what we need, but I could not get that to work anyway). But, IGC is able to get the number of registers allocated as that info shows up in debug logs if the compilation fails due to register spills. We likely need to raise this with the IGC team and/or the level zero team to see if there is a known config that allows retrieving allocated registers for a kernel, or if we need to create a feature request.

@etaf
Copy link
Author

etaf commented Jul 30, 2024

@EikanWang maybe we should create a feature reqeust to IGC team and/or the level zero team for this feature?

@etaf
Copy link
Author

etaf commented Aug 6, 2024

Hi, @vlad-penkin , we need this feature for performance optimization, will Triton team raise this request to the IGC team ?

@vlad-penkin
Copy link
Contributor

Hi, @vlad-penkin , we need this feature for performance optimization, will Triton team raise this request to the IGC team ?

Yes

@etiotto
Copy link
Contributor

etiotto commented Aug 8, 2024

Conceptually this is similar to what Triton (for the XPU target) already does to get the spill size from IGC. The code for that is (see driver.c in the intel third_party directory):

Image

We use L0 APIs to retrieve the structure ze_kernel_properties_t . That structure declaration is:

image

So it doesn't contain a field for the number of registers used by the kernel. We need the L0 API team to add the n_regs field to that structure. Once the L0 API is updated, the work in Triton to expose the n_regs field to users (e.g. PyTorch) is pretty straightforward.

@vlad-penkin can you please open a feature request for this ?

@vlad-penkin
Copy link
Contributor

@vlad-penkin can you please open a feature request for this ?
Done.

@etaf you should receive an email notification on this feature request creation

@etaf
Copy link
Author

etaf commented Aug 9, 2024

@vlad-penkin can you please open a feature request for this ?
Done.

@etaf you should receive an email notification on this feature request creation

Thanks.

@alexbaden
Copy link
Contributor

After investigating further this issue is not applicable to us - our GPUs have a fixed set of registers per thread and we do not scale occupancy of a work group based on register pressure (unlike CUDA, which dynamically adjust registers allocated per thread and adjust parallelism down for kernels with high memory pressure). Therefore, returning 0 is appropriate here because we do not want to tell the PyTorch Inductor auto-tuner to tune based on a parameter that has no effect.

Note that with #1654 we now auto-tune the GRF mode when compiling the Triton kernel to recompile with large GRF mode in the presence of register spills. I suppose we could consider doing this tuning in PyTorch Inductor instead, but that would require quite a bit more work and it is not clear what the benefit would be over what we are doing now.

@vlad-penkin I think we can close this ticket.

@etaf
Copy link
Author

etaf commented Sep 19, 2024

After investigating further this issue is not applicable to us - our GPUs have a fixed set of registers per thread and we do not scale occupancy of a work group based on register pressure (unlike CUDA, which dynamically adjust registers allocated per thread and adjust parallelism down for kernels with high memory pressure). Therefore, returning 0 is appropriate here because we do not want to tell the PyTorch Inductor auto-tuner to tune based on a parameter that has no effect.

Note that with #1654 we now auto-tune the GRF mode when compiling the Triton kernel to recompile with large GRF mode in the presence of register spills. I suppose we could consider doing this tuning in PyTorch Inductor instead, but that would require quite a bit more work and it is not clear what the benefit would be over what we are doing now.

@vlad-penkin I think we can close this ticket.

Thanks @alexbaden , agree.
And In Inductor side, besides tune GRF mode, we can reduce RBLOCK in in the presence of register spills.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment