-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641
Comments
I attempted to use https://spec.oneapi.io/level-zero/latest/tools/api.html#zetkernelgetprofileinfo to store register use when compiling the kernel, but I am not having success getting |
@EikanWang maybe we should create a feature reqeust to IGC team and/or the level zero team for this feature? |
Hi, @vlad-penkin , we need this feature for performance optimization, will Triton team raise this request to the IGC team ? |
Yes |
Conceptually this is similar to what Triton (for the XPU target) already does to get the spill size from IGC. The code for that is (see driver.c in the intel third_party directory): We use L0 APIs to retrieve the structure So it doesn't contain a field for the number of registers used by the kernel. We need the L0 API team to add the @vlad-penkin can you please open a feature request for this ? |
@etaf you should receive an email notification on this feature request creation |
Thanks. |
After investigating further this issue is not applicable to us - our GPUs have a fixed set of registers per thread and we do not scale occupancy of a work group based on register pressure (unlike CUDA, which dynamically adjust registers allocated per thread and adjust parallelism down for kernels with high memory pressure). Therefore, returning 0 is appropriate here because we do not want to tell the PyTorch Inductor auto-tuner to tune based on a parameter that has no effect. Note that with #1654 we now auto-tune the GRF mode when compiling the Triton kernel to recompile with large GRF mode in the presence of register spills. I suppose we could consider doing this tuning in PyTorch Inductor instead, but that would require quite a bit more work and it is not clear what the benefit would be over what we are doing now. @vlad-penkin I think we can close this ticket. |
Thanks @alexbaden , agree. |
Hi, recently we want to follow the design of CUDA/ROCm to enable support for Inductor's dynamic_rblock_scaling, but we need to get number of registers used for a combined kernel like this:
https://github.com/pytorch/pytorch/blob/7c45476d38176c8d5b19fb379fc073dc21beba64/torch/_inductor/runtime/triton_heuristics.py#L273-L296
But currently Intel Triton always return n_regs = 0.
Can we return the n_regs like cuda does?
reference: ROCm is implementing dynamic_rblock_scaling like CUDA: pytorch/pytorch#129663
The text was updated successfully, but these errors were encountered: