[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

etaf · 2024-07-17T05:55:01Z

Hi, recently we want to follow the design of CUDA/ROCm to enable support for Inductor's dynamic_rblock_scaling, but we need to get number of registers used for a combined kernel like this:
https://github.com/pytorch/pytorch/blob/7c45476d38176c8d5b19fb379fc073dc21beba64/torch/_inductor/runtime/triton_heuristics.py#L273-L296

But currently Intel Triton always return n_regs = 0.
Can we return the n_regs like cuda does?

reference: ROCm is implementing dynamic_rblock_scaling like CUDA: pytorch/pytorch#129663

alexbaden · 2024-07-20T22:03:27Z

I attempted to use https://spec.oneapi.io/level-zero/latest/tools/api.html#zetkernelgetprofileinfo to store register use when compiling the kernel, but I am not having success getting zeModuleCreate to respect the profile flag (the register count always returns 0). I looked at SYCL APIs for getting registers and was not successful there either (I don't think privateMemory is what we need, but I could not get that to work anyway). But, IGC is able to get the number of registers allocated as that info shows up in debug logs if the compilation fails due to register spills. We likely need to raise this with the IGC team and/or the level zero team to see if there is a known config that allows retrieving allocated registers for a kernel, or if we need to create a feature request.

etaf · 2024-07-30T02:36:46Z

@EikanWang maybe we should create a feature reqeust to IGC team and/or the level zero team for this feature?

etaf · 2024-08-06T06:24:33Z

Hi, @vlad-penkin , we need this feature for performance optimization, will Triton team raise this request to the IGC team ?

vlad-penkin · 2024-08-06T12:52:01Z

Hi, @vlad-penkin , we need this feature for performance optimization, will Triton team raise this request to the IGC team ?

Yes

etiotto · 2024-08-08T17:13:44Z

Conceptually this is similar to what Triton (for the XPU target) already does to get the spill size from IGC. The code for that is (see driver.c in the intel third_party directory):

We use L0 APIs to retrieve the structure ze_kernel_properties_t . That structure declaration is:

So it doesn't contain a field for the number of registers used by the kernel. We need the L0 API team to add the n_regs field to that structure. Once the L0 API is updated, the work in Triton to expose the n_regs field to users (e.g. PyTorch) is pretty straightforward.

@vlad-penkin can you please open a feature request for this ?

vlad-penkin · 2024-08-08T20:11:31Z

@vlad-penkin can you please open a feature request for this ?
Done.

@etaf you should receive an email notification on this feature request creation

etaf · 2024-08-09T00:45:57Z

@vlad-penkin can you please open a feature request for this ?
Done.

@etaf you should receive an email notification on this feature request creation

Thanks.

alexbaden · 2024-09-09T13:12:00Z

After investigating further this issue is not applicable to us - our GPUs have a fixed set of registers per thread and we do not scale occupancy of a work group based on register pressure (unlike CUDA, which dynamically adjust registers allocated per thread and adjust parallelism down for kernels with high memory pressure). Therefore, returning 0 is appropriate here because we do not want to tell the PyTorch Inductor auto-tuner to tune based on a parameter that has no effect.

Note that with #1654 we now auto-tune the GRF mode when compiling the Triton kernel to recompile with large GRF mode in the presence of register spills. I suppose we could consider doing this tuning in PyTorch Inductor instead, but that would require quite a bit more work and it is not clear what the benefit would be over what we are doing now.

@vlad-penkin I think we can close this ticket.

etaf · 2024-09-19T08:54:17Z

After investigating further this issue is not applicable to us - our GPUs have a fixed set of registers per thread and we do not scale occupancy of a work group based on register pressure (unlike CUDA, which dynamically adjust registers allocated per thread and adjust parallelism down for kernels with high memory pressure). Therefore, returning 0 is appropriate here because we do not want to tell the PyTorch Inductor auto-tuner to tune based on a parameter that has no effect.

Note that with #1654 we now auto-tune the GRF mode when compiling the Triton kernel to recompile with large GRF mode in the presence of register spills. I suppose we could consider doing this tuning in PyTorch Inductor instead, but that would require quite a bit more work and it is not clear what the benefit would be over what we are doing now.

@vlad-penkin I think we can close this ticket.

Thanks @alexbaden , agree.
And In Inductor side, besides tune GRF mode, we can reduce RBLOCK in in the presence of register spills.

alexbaden self-assigned this Jul 19, 2024

vlad-penkin added upstream: pytorch enhancement New feature or request driver labels Jul 24, 2024

vlad-penkin added this to the 0.1 [PT Upstream] TorchInductor milestone Jul 24, 2024

vlad-penkin assigned Sarbojit2019 and unassigned alexbaden Jul 29, 2024

vlad-penkin modified the milestones: 0.1 [PT Upstream] TorchInductor - 2.4, 0.1 [PT Upstream] TorchInductor - 2.5 Aug 2, 2024

vlad-penkin assigned etiotto and unassigned Sarbojit2019 Aug 2, 2024

etiotto assigned vlad-penkin Aug 8, 2024

vlad-penkin unassigned etiotto and vlad-penkin Aug 8, 2024

vlad-penkin added driver: rolling dependencies: level_zero and removed driver labels Aug 18, 2024

alexbaden self-assigned this Sep 9, 2024

vlad-penkin modified the milestones: 0.1 [PT 2.5 Upstream] TorchInductor, 0.1 [PT 2.6 Upstream] TorchInductor Sep 11, 2024

vlad-penkin assigned vlad-penkin and unassigned alexbaden Sep 16, 2024

vlad-penkin closed this as completed Sep 16, 2024

vlad-penkin assigned etiotto and unassigned vlad-penkin Sep 16, 2024

vlad-penkin reopened this Sep 16, 2024

etiotto closed this as completed Sep 23, 2024

vlad-penkin reopened this Sep 23, 2024

vlad-penkin assigned alexbaden and unassigned etiotto Sep 23, 2024

alexbaden linked a pull request Oct 1, 2024 that will close this issue

Return n_regs for binaries compiled explicitly with a register size mode option #2391

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

etaf commented Jul 17, 2024 •

edited

Loading

alexbaden commented Jul 20, 2024

etaf commented Jul 30, 2024

etaf commented Aug 6, 2024

vlad-penkin commented Aug 6, 2024

etiotto commented Aug 8, 2024 •

edited

Loading

vlad-penkin commented Aug 8, 2024

etaf commented Aug 9, 2024

alexbaden commented Sep 9, 2024

etaf commented Sep 19, 2024 •

edited

Loading

[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

[Pytorch Upstream] Intel Triton is not ready to return n_regs for a compiled_binary. #1641

Comments

etaf commented Jul 17, 2024 • edited Loading

alexbaden commented Jul 20, 2024

etaf commented Jul 30, 2024

etaf commented Aug 6, 2024

vlad-penkin commented Aug 6, 2024

etiotto commented Aug 8, 2024 • edited Loading

vlad-penkin commented Aug 8, 2024

etaf commented Aug 9, 2024

alexbaden commented Sep 9, 2024

etaf commented Sep 19, 2024 • edited Loading

etaf commented Jul 17, 2024 •

edited

Loading

etiotto commented Aug 8, 2024 •

edited

Loading

etaf commented Sep 19, 2024 •

edited

Loading